The Speed, Power, and Cost Benefits of AI on IBM Z

December 04, 2024 | 4 min.

Blog

While AI tools continue to advance, the hardware to support it is barely keeping up. Today’s industry standard hardware for running AI workloads is cloud-based GPU farms featuring x86-based processors hosing the GPU cards. While the x86 platform has come a long way since it was introduced in 1978, it has some drawbacks. When scaled out, it uses massive amounts of power to the point where Microsoft is considering buying a nuclear power plant to run its data centers. And, when AI working sets are sent to the cloud for inference, the resulting latency limits AI’s potential to process thousands or millions of requests quickly.

While Intel-based GPU farms will continue to be a valued resource, especially for AI model training, IBM now makes it possible to do large-scale AI inference on-premise with its IBM Z mainframe. On-site inference completely eliminates the latency associated with cloud-based inference and does it with significantly less energy.

Real-Time, On-Premise Inference

The IBM Telum chip is the heart of the IBM Z mainframe and can handle a massive number of AI inference requests per second with minimal latency, which is ideal for high-volume transaction processing. This enables organizations to analyze large volumes of data in real-time to make instantaneous decisions on transactions, such as scoring transactions for fraud. The platform is scalable to meet the demands of growth and helps keep data private and protected because it doesn’t go off-premise.

Latency for scoring transactions or running other AI workloads comes largely from today’s process for requesting cloud-based inference. An inference request to an Intel-based GPU farm starts with bundling up the AI working set with Intel tools and driving it across a network to a cloud-based GPU farm. IBM Telum eliminates all that work. You can simply do inference in real-time on the on-premise mainframe and eliminate the extra work. As far as we know, IBM is the only company offering an alternative to GPU farms for on-premise real-time inference at this time.

Key Advantages of AI on IBM Z

Speed

AI workloads like fast engines and a lot of memory. The Telum chip delivers that with eight cores and a clock speed of 5.2 Ghz. On-site inference coupled with a speedy chip is now providing the ability to score 100% of transactions for fraud, which is a huge benefit to the credit card and banking industry. But this could be the tip of the iceberg. Telum throws the door wide open to new use cases that require lightning-fast inference to deliver value.

Frugal with Power

Dramatically lower power consumption is another benefit. The voracious power requirements for GPUs are counterproductive to the trend for more sustainable operations. And this can go way beyond saving power on transactions. Any organization hoping to lean hard into GenAI across the enterprise is likely to see significant power savings using IBM Z and the Telum processor to run large language models.

Guaranteed Privacy

Unlike most other GenAI platforms, IBM has made a legal commitment not to use your data and information to train its models so that your proprietary information stays private. For example, running IBM watsonx Code Assistant to modernize COBOL applications on the IBM Z means your proprietary code never leaves the data center.

Lower Software Costs

IBM Z also enables organizations to lower software licensing costs. Because inference is actually run on the IBM z Integrated Information Processor (zIIP), calculations performed on the zIIP engine aren’t included in the calculations for software charges, so shifting workloads to IBM Z can reduce costs.

A Compelling Roadmap

In order to make IBM Z the platform of choice for running AI workloads, IBM is continuing to innovate it. The company has laid out a clear roadmap for the near future and has been very open about its plans.

In the short term, we’re expecting IBM to release its next generation IBM Telum II processor. Telum II will have a 5.5 Ghz clock speed using the same amount of energy as the current Telum processor. It will also have 36MB of L2 cache per core to efficiently handle large databases.

In addition, IBM will include its new GPU-like Spyre accelerator board that will be installed in clusters for fine tuning AI workloads. Each Spyre card uses 75 watts of electricity compared to the 1,000 watts needed to power a traditional GPU. With 128GB of LPDDR5 memory, performance is measured at 300+ TOPS (trillions of operations per second).

And it doesn’t stop there. IBM put the classic IO channel subsystem onto an adjacent data processing unit (DPU) that’s specialized for IO acceleration, which reduces IO channel subsystem energy consumption by 70%, enabling customers to expand their mainframe capabilities without added power consumption.

The additional speed and processing power delivered by the Spyre board can help organizations improve confidence levels for their models. Confidence levels for today’s fraud detection models running on the Telum processor range from about 75-90%. The Spyre accelerator aims to improve model confidence levels by fine-tuning AI models on-premises. By running the same data against a set of more accurate models, the Spyre card can help organizations improve AI confidence for use cases that need it, for example, a more granular analysis of each credit card transaction.

How Evolving Solutions Can Help

The value to most organizations using AI comes from Inference rather than training. We suggest you take a close look at how AI on IBM Z may be able to accelerate workloads and scale AI across your organization with fast inference, low power, and guaranteed privacy.

Contact Evolving Solutions today to get started.

The Speed, Power, and Cost Benefits of AI on IBM Z

Real-Time, On-Premise Inference

Key Advantages of AI on IBM Z

A Compelling Roadmap

How Evolving Solutions Can Help

Jim Fyffe

Expertise

Resources

About Us

Contact Us