By definition, HPC is always at the cutting edge of computing, driving innovations in processor, system, and software design that eventually find their way into more mainstream computing systems. This, in fact, is one of the key founding premises of The Next Platform.
Today it is machine learning and data analytics that are creating many of the most challenging HPC workloads as we move inexorably towards the exascale era. Which means data handling and storage, and the need to be vendor agnostic when necessary, are becoming increasingly important when building your HPC stack – and getting the most out of it. So what is it the state of the art and how is it likely to change in the near term?
HPC was born as a response to the need for specialized computers for simulation and modeling, with their power measured by floating point operations per second. Before introducing the Cray 1 supercomputer in 1972, Seymour Cray participated in the development of the CDC 6600 system in 1964 that was capable of three megaFLOPS and reigned as the world’s fastest computer for five years. There would be a relative handful of such systems in use and customers – the CDC 6600 system scored 100 customers, largely in very big business, the US government, and the military.
Over the decades, the scale and adoption of HPC evolved to include more and multi-core processors, high-performance connectors were developed and machines clustered. In, too, came co-design with HPC makers working with partners and application scientists on areas such as high-performance storage.
Today’s HPC systems are routinely capable of tens or even hundreds of petaflops of performance, and the market is worth more than $11.2 billion, according to Hyperion Research, spanning verticals like bio and geo sciences, chemical engineering, financial services, academia, and weather services. What is driving HPC’s growth today versus half a century ago is big data, specifically analytics, with machine learning emerging as formative markets.
Tapping into this trend and taking advantage of HPC at scale means creating the right stack – the foundation of which is the right storage architectures according to Cray’s director of worldwide systems engineering, Rex Tanakit. Get it wrong, and your massive investment in the gargantuan power of compute will account for nothing.
“As we move towards exascale, computing will generate a lot of data. The applications are going to read a lot of data and process a lot data. And, computation and processing are going to generate more data,” he says. “Making sure your investment in HPC is properly used is a requirement at all levels and that means efficient fast storage to feed data to compute as quickly as possible.”
“This fact applies to all workloads, both typical and atypical – at a national laboratory level or across the many different industries that are using HPC for diverse workloads and many different data types. It is difficult to have a single storage design that works for everyone,” he says.
Not All HPC Is Created Equal
HPC spans organizations of all sizes. Workloads are, therefore, varied – a fact that means a plethora of HPC environments. For example, running millions of simulations in parallel researching pharmaceutical discoveries or weather predictions with multiple teams sharing resources often means the need for high-speed, main memory to hold large amounts of data in applications. When it comes to science at the national-lab level, systems must work with custom applications while engineers must tweak and optimise their HPC systems for the code.
Artificial Intelligence – with its machine learning and neural networks – is a fast-growing market that encapsulates many of these use cases. It wraps in big data, analytics, speed of processing, and the need to build and tweak systems for new or custom code. AI is coming to HPC thanks to breakthroughs in the software, with code accessible to more developers than before and capable of running on HPC’s compute and storage.
“Getting the data and data architectures right here is critical. AI is a massive consumer of data – data served at speed and scale. The scale element is based on what is a reasonably sized dataset for your AI model to start learning” continues Tanakit. “The ability of the system to provide intelligence is dependent upon the data’s quantity – more is better – and its granularity – greater segmentation is preferred – and its quality – taken from reputable sources.”
Cray believes algorithms improve over time, as long as new data is fed into the model. This requires a large reserve of data storage and processing power. It is wise, therefore, to select systems that can balance performance, scalability, and availability.
According to Tanakit, 500 GB has become a reasonably sized dataset – but you will probably have many of these. “Nowadays, a petabyte or half a petabyte is common for a typical AI workload. But often the question that firms can’t answer is: ‘What does your dataset look like’,” he says.
To answer that, you need the right tools to run and collect data. “It is important to do the analysis so we can say, this is what your storage workload looks like. Then we can do the configuration and match the right technology to the correct workload,” Tanakit says.
AI is just one example – there are plenty more applications falling into the growing category of “data-intensive” coming to these big systems. Others include genomics, computational chemistry, protein modelling, advanced weather simulation and forecasting, and oil and gas seismic processing. Their common characteristics are that they are compute intensive demand high-levels of network performance and huge amounts of memory.
That brings us back to data.
Balancing Capacity With Performance
What is the best type of optimized storage platform for the different types of workloads and workflows? The same system is likely to have to cope with a mixture of a variety of large files, streaming data and small files. These mixed workloads will require a hybrid hardware set up comprising of traditional SAS disk drives that are good for streaming, and SSD for high IOPS.
That means building a storage architecture with the correct ratio of disk to flash. Very few organizations can afford to go all-flash to manage petabytes of data – Uber has done it, has have others – and flash is not always a suitable media for long-term storage and retrieval. In industrial use cases, this still leaves the challenge of identifying the data and optimizing the movement of data between disk and SSDs.
Enter policy driven storage as offered through Cray’s ClusterStor L300N storage system with its NXD accelerator that identifies and directs small data blocks to SSD and large data streams to disks. ClusterStor L300N storage system manages mixed I/Os that combines a new hardware configuration with the software to provide an automatic capability to selectively accelerate performance without a separate storage tier. Read persistence, write back, I/O histogram, performance statistics and dynamic flush are also included in the software. ClusterStor L300N storage system with NXD flash accelerator handles small file I/O and large sequential I/O for parallel file systems in a seamless manner.
Of course, there’s flash and – increasingly – industry talk of all-flash-based HPC storage systems. Flash is poised to become an essential technology in HPC storage, with sales of flash growing across the industry.
All-flash systems have a lot of appeal from a performance point of view – flash is around 15X faster than disk. Flash, therefore, easily beats disk on price for per IOPS and on throughput. Flash, however, is expensive for capacity storage – roughly 5X that of disk. Given that fact and the footprint of disks and the nature of mixed I/O workloads, Disk drives will be a reality for some time, making storage in HPC a hybrid play.
A flash-based storage tier between the compute cluster and a slower disk-based storage tier can provide a faster storage resource to the supercomputer. But how to straddle this mixed world?
Employing flash and disk should be done without having to use separate storage tiers, forcing data movement, without requiring users to re-write or re-compile applications, and without the need for complicated policy engines to handle the workflows.
It should, in short, be transparent to the user, the application, and the file system of choice.
In addition to the ClusterStor L300N storage system with NXD, Cray recently introduced the ClusterStor L300F storage system, a full flash two rack unit, 24 SSD enclosure that is designed to create a hybrid flash/disk system with flash-acceleration that directs I/O to the appropriate storage medium. The L300F simplifies storage management, because it lets admins create flash pools within their existing Lustre-based file systems – using their existing tools and skills.
Luster And OpenSFS
Today, ClusterStor storage system is exclusively based on the Lustre parallel file system, the de facto standard in leadership-class supercomputing environments, and now 77 percent of the Top 100 systems are using Lustre, according to the June 2018 Top 500 supercomputer rankings.
As a leader in open systems and parallel file systems, and co-founder and benefactor of OpenSFS, Cray builds on community-driven Lustre to unlock the performance of Linux clusters and supercomputers using Cray’s proven HPC storage system architectures.
Cray realized early on that proprietary file systems will struggle to meet the storage requirements in the future both architecturally and economically and has aligned the storage strategy early around Lustre. As a leading participant at the Lustre User Group, Cray engineers are in touch with all the players and they regularly contribute expertise and code to the community.
In late 2017, Cray acquired its own Lustre development and support team through the strategic transaction with Seagate, adding to a number of Lustre development and support engineers already at Cray. Of the 18 organizations contributing to the newest version of Lustre (2.11), Cray was the number three contributor both in the number of commits and lines of code changed.
Among Cray’s contributions: adding enterprise reliability to an already robust filesystem – including adaptive time outs, pools, and additional features around ease of use – and raising the bar on Lustre for large production sites.
Naturally, Lustre has been optimized for disk but what about flash? Cray is optimizing Lustre for flash, including server-side configuration, setup and IOPS performance tuning. With the flash optimization, Cray is reducing latency in the software itself that was hidden behind the disk technology. Cray has also said it’ll explore use cases of implementing flash over Lustre in SAS, NVM-Express, and NMV-Express over Fabrics. As those optimizations are made, Cray is contributing them all back to the community.
Not Standing Still
It is clear that, for HPC applications, only a mixed storage architecture and open systems model can create a viable roadmap to true exascale computing – something that protects existing investments while providing a scalable migration path.
Concludes Tanakit: “Today everyone talks about the compute part of HPC and the move towards exascale, but storage is becoming ever more important in that space. Don’t buy large-scale compute unless you balance with a proper storage strategy built around efficient data movement across SSD and disk.”
Doing so will mean you avoid creating bottlenecks, you can optimize workload efficiency and maximize return on investment. Not doing so could mean you find yourself just burning power and underutilizing huge amounts of CPU while standing still.