BeijingApril 23, 2026 /PRNewswire/ — Summary: Shanghai Kaiyun Digital Technology Co., Ltd. (hereinafter referred to as Kaiyun), in collaboration with IBM, has launched a comprehensive “Predict, Schedule, Control, Monitor” strategy based on the IBM Spectrum LSF (hereinafter referred to as LSF) high-performance job scheduling and resource management platform, targeting memory resource optimization. This initiative helps enterprises reduce costs and improve efficiency amid the storm of computing power costs.
In today’s volatile semiconductor supply chain, the prices of core hardware such as server memory continue to rise. For enterprises relying on high-performance computing (HPC), the previous extensive approach of “adding more hardware when it’s insufficient” is no longer sustainable. Under cost pressure, how to fully tap the potential of every existing memory module without blindly purchasing new equipment has become a key issue related to the core competitiveness of enterprises.
Shanghai Kaiyun Digital Technology Co., Ltd. (hereinafter referred to as Kaiyun), in collaboration with IBM, has launched a comprehensive “Predict, Schedule, Control, Monitor” strategy based on the IBM Spectrum LSF (hereinafter referred to as LSF) high-performance job scheduling and resource management platform, targeting memory resource optimization. This initiative helps enterprises reduce costs and improve efficiency amid the storm of computing power costs.
AI to “Predict” the Real Needs of Jobs
When users submit jobs, they often adopt a “better safe than sorry” strategy because they cannot accurately estimate memory consumption. While this approach seems prudent, it leads to a large amount of memory resources being requested but remaining idle for long periods. The cluster cannot accept new jobs, resulting in low overall utilization.
The LSF Predictor, combined with IBM watsonx’s machine learning capabilities, effectively solves this problem. The system automatically analyzes the characteristics of historical jobs (users, submission commands, input data, etc.) to train a high-precision prediction model. When a user submits a job again, the system intelligently predicts the memory resources and runtime required by the job, eliminating resource overestimation at the source and achieving a qualitative leap in cluster memory utilization.
The Path to HPC Breakthrough Amid Memory Surge – Kaiyun and IBM Join Forces to Deliver a “Combination Punch” for Computing Optimization
Simple scheduling strategies can easily lead to memory fragmentation, where large jobs cannot enter and small jobs cannot fill the gaps, keeping cluster utilization at a low level. It’s like a poorly arranged game of Tetris that requires intelligent orchestration to utilize space reasonably and reduce resource waste.
Kaiyun leverages LSF’s efficient scheduling algorithms to achieve “granular-level” control over memory resources. Taking the backfill scheduling mechanism as an example: when the system reserves memory for a high-priority large job, the scheduler automatically finds time gaps to run short jobs during the waiting period, ensuring memory remains fully utilized. Meanwhile, affinity scheduling ensures that computing cores prioritize accessing the nearest local memory, shortening memory occupancy cycles by improving job execution speed, thereby increasing resource turnover. There are many similar scheduling strategies in LSF that enhance memory utilization. Based on LSF, Kaiyun has accumulated rich practical experience.
The Path to HPC Breakthrough Amid Memory Surge – Kaiyun and IBM Join Forces to Deliver a “Combination Punch” for Computing Optimization
The Path to HPC Breakthrough Amid Memory Surge – Kaiyun and IBM Join Forces to Deliver a “Combination Punch” for Computing Optimization
Certain abnormal jobs or programs with memory leaks may consume all resources of a server, causing system crashes and affecting other jobs. To address this, LSF provides multi-dimensional mechanisms to ensure reasonable memory resource utilization.
LSF offers two types of memory limit strategies: soft limits and hard limits. A soft limit acts as a “warning line,” where the system tries to keep job memory consumption within this range but allows brief exceedances for some buffer. A hard limit, on the other hand, is an insurmountable “red line.” Once a job hits it, LSF immediately terminates it to prevent a single job from bringing down the entire node. At the same time, LSF can deeply integrate with Linux container technologies to build a multi-layered memory protection system for each job, ensuring overall cluster stability. Additionally, the dynamic preemption mechanism allows core business to “borrow” memory from low-priority jobs during resource shortages, ensuring critical tasks run first. Through this combination of “soft and hard” strategies, the cluster can maximize effective memory utilization while maintaining stable operation.
Real-time “Monitoring” Leaves No Room for Waste
Without detailed monitoring, administrators often find it difficult to intuitively identify which jobs in the cluster are consuming large amounts of resources with little actual computing contribution, leaving optimization efforts without clear data support.
With the LSF monitoring platform, the system can identify in real time jobs that request high memory but have extremely low loads, automatically generating detailed resource consumption reports broken down by department, project group, user, and other dimensions. Additionally, leveraging the Kaiyun ICP Intelligent Computing Platform, with IBM LSF as the core underlying engine, it further integrates scheduling, monitoring, analysis, and optimization, providing enterprises with full lifecycle management from computing resource allocation to resource optimization.
The Path to HPC Breakthrough Amid Memory Surge – Kaiyun and IBM Join Forces to Deliver a “Combination Punch” for Computing Optimization
These reports clearly show the actual usage efficiency of each resource segment, helping administrators quickly pinpoint sources of waste and promptly reclaim idle memory. At the same time, this data provides an objective basis for adjusting daily scheduling strategies and builds a scientific decision-making loop for future hardware procurement, cluster expansion, or architecture optimization, ensuring every resource investment is traceable and data-driven.
Take a leading domestic chip design company as an example. This client faced severe memory resource waste in EDA simulation scenarios, with overall cluster memory utilization consistently below 50%. Even with continuous hardware expansion, job queuing remained a serious issue.
Kaiyun implemented the LSF-based “Predict, Schedule, Control, Monitor” strategy to build a precise memory resource management system for this client. After deployment, cluster memory utilization increased to over 78%, and the average job waiting time was reduced by more than 30%. This effectively freed up computing capacity equivalent to dozens of servers without adding new hardware, saving the client millions of yuan in hardware procurement costs annually.
New LSF Version Enables More Precise Memory Management
As users place increasing importance on memory utilization, in response to demand, the upcoming new version of LSF will introduce a memory reporting feature that significantly enhances the ability to track job memory usage. This feature not only allows viewing memory data at the job level (such as requested memory, actual peak and average usage, and swap usage) but also provides derived metrics (including memory waste or shortage, usage pressure, risk levels, and comparison of peak vs. average values). It also supports weighted calculation of overall memory usage efficiency based on runtime. In the summary overview, users can see average memory usage, the reasonableness of job requests, risk distribution, and cumulative figures for overall memory reservation, usage, waste, and shortage, facilitating a comprehensive assessment of cluster memory utilization and job request reasonableness.
In the current climate of persistently high hardware prices, “intensive cultivation” is no longer just a nice-to-have but an inevitable choice for the sustainable development of HPC. The combined solution developed by Shanghai Kaiyun and IBM integrates AI’s intelligent prediction, fine-grained scheduling control, strict risk boundaries, and transparent monitoring into a complete resource optimization loop. Through technological means, it ensures that every hardware investment by enterprises translates into tangible research output and production efficiency, achieving true “cost reduction and efficiency improvement.”
Yang Jie, Deputy General Manager of Shanghai Kaiyun Digital Technology Co., Ltd., stated: “Against the backdrop of persistently high hardware costs, the memory utilization of HPC clusters directly determines an enterprise’s R&D efficiency and competitiveness. Kaiyun’s ‘Predict, Schedule, Control, Monitor’ solution, built on LSF, from AI prediction to fine-grained scheduling and multi-layered transparent monitoring, truly helps enterprises make the most of every byte of memory. This is not just a technological upgrade but a revolution in computing resource management philosophy.”
He Jinchi, Architect at IBM China Technology Business Unit, stated: “The core advantage of LSF lies not only in its powerful scheduling capabilities but also in its deep integration with cutting-edge technologies like AI, transforming resource prediction from ‘relying on experience’ to ‘relying on data,’ solving users’ most real pain points. Additionally, LSF further optimizes data access and migration efficiency during job execution through intelligent data management mechanisms. LSF also offers a rich set of scheduling strategies to comprehensively ensure efficient cluster operation.”
Xu Weijie, General Manager of Automation Business, IBM Greater China Group, stated: “Currently, computing power has become the core carrying capacity for enterprise digital and intelligent transformation, and the key to improving efficiency lies in fine-grained resource management. Together with Kaiyun, we have built a closed-loop solution based on LSF, covering scheduling, prediction, and monitoring, helping enterprises fully unleash the potential of existing computing power without purchasing additional hardware. In the future, IBM will continue to deepen technological innovation in the HPC field, helping enterprises achieve a win-win of cost reduction, efficiency improvement, and business growth.”
Shanghai Kaiyun Digital Technology Co., Ltd. is a high-tech digital technology innovation enterprise, a specialized and new enterprise, and a little giant enterprise. We focus on two core businesses: “Advanced Information Technology Services” and “Smart Manufacturing Scenario Software Development,” providing our clients with advanced productivity construction, digital transformation, big data, and artificial intelligence technologies. In the “Advanced Information Technology Services” field, Kaiyun offers application, construction, and operation of numerous technology scenarios including intelligent computing, AI, big data, cloud computing, and information security. In the “Smart Manufacturing Scenario Software Development” field, Kaiyun creates business value for clients through products such as the Kaiyun ICP Intelligent Computing Platform, CMES Smart Manufacturing Software, and CCLab-WorkFlow Intelligent Workflow Software.
IBM
IBM is a leading global hybrid cloud, artificial intelligence, and enterprise services provider, helping clients in over 175 countries and regions derive business insights from their data, streamline business processes, reduce costs, and gain a competitive edge in their industries. More than 4,000 government and corporate entities in critical infrastructure sectors such as financial services, telecommunications, and healthcare rely on the IBM hybrid cloud platform and Red Hat OpenShift to achieve digital transformation quickly, efficiently, and securely. IBM’s breakthrough innovations in artificial intelligence, quantum computing, industry-specific cloud solutions, and enterprise services provide our clients with open and flexible choices. A long-standing commitment to corporate integrity, transparent governance, social responsibility, an inclusive culture, and service spirit is the cornerstone of IBM’s business development.
click?upn=u001
