By Ghassan Azzi, Sales Director, Africa At Western Digital
Artificial intelligence (AI) has revolutionized the world around us, and its transformative impact stems from its ability to analyze vast amounts of data, learn from it and offer insights and automation capabilities. This data is often spread out in data warehouses, data lakes, the cloud and on-premises data centers – ensuring critical information can be accessed and analyzed for today’s AI initiatives.
One of the effects of AI’s proliferation is the disruption of traditional business models.
Organizations are increasingly relying on AI to enhance customer experiences, streamline operations and drive innovation. To maximize the benefits of AI, it’s crucial to adopt advanced storage architectures. NVMe over Fabrics (NVMe-oF™) provides low-latency, high-throughput access needed for AI workloads, accelerating performance and reducing potential bottlenecks.
Implementing disaggregated storage enables greater flexibility and enables scaling of storage and compute independently to maximize resource utilization. Businesses that fail to implement the most suitable architecture and integrate AI into their models risk falling behind in an increasingly data-driven world.
Considerations in Deploying Machine Learning Models
Organizations are under constant pressure to derive as much value out of their data as quickly as possible – yet, they must do so in a cost-efficient manner that doesn’t inhibit regular business operations. As a result, relying on commodity storage on premises or in the cloud isn’t as ideal anymore.
Organizations need to build high-performance, flexible and scalable compute environments that support the real-time processing needs of today’s AI workflows. Efficient purpose-built data storage is crucial in these use cases, and organizations should make considerations for data volume, velocity, variety and veracity.
Organizations are now able to build public cloud-like infrastructures in on-premises data centers that give them the flexibility and scalability of the cloud with the control and cost efficiency of private infrastructure.
Architected correctly, these environments can provide more bang for the buck – providing a much more efficient way of supporting the high-performance, highly-scalable requirements of storage environments primed for AI applications. In fact, repatriating your AI/ML datasets to on-premises data centers from the cloud may be an ideal option for organizations operating within certain performance or cost limits.
Building an On-Premises Storage Environment for AI Applications
Organizations can build powerful storage environments that have the flexibility and scale of the public cloud, but the manageability and consistency of private infrastructures. Here are three things to consider when building on-premises storage environments, ideally suited to the needs of today’s AI/ML powered world:
1. Server Selection
AI applications require significant compute resources to process and analyze ML data sets quickly and efficiently, making the selection of a suitable server architecture absolutely critical. Most important, however, is the ability to scale GPU resources without creating a bottleneck in the system.
2. High-Performance Storage Networking
It’s also important to include high-performance storage networking that has the capability to not only meet (and exceed) the ever-increasing performance demands of GPUs, but also to provide scalable capacity and throughput to meet learning model data set sizes and performance demands. Storage solutions that can take advantage of direct path technology enable direct GPU to storage communication and in doing so, bypass the CPU to enhance data transfer speeds, reduce latency and improve utilization.
3. Based on Open Standards
Finally, solutions should be hardware and protocol agnostic, providing multiple ways to connect to the server and storage to the network. The interoperability of your infrastructure will go a long way toward building a flexible environment primed for AI applications.
Building a New Architecture
Building public cloud-like infrastructures on-premises may provide a solid option – giving organizations the flexibility and scalability of the cloud with the control and cost efficiency of private infrastructure. However, it’s important that the right storage architecture decisions are being made with AI considerations in mind – providing the right combination of compute power and storage capacity that AI applications need to move at the speed of business.
One way to ensure proper resource allocation and reduce bottlenecks is through storage disaggregation. Independently scaling storage allows for GPU saturation, which can otherwise be challenging in many AI/ML workloads using hyper converged solutions. This means that storage can be efficiently scaled without compromising performance.
The combination of Western Digital’s RapidFlex™ technology, Ingrasys’ ES2100 with integrated NVIDIA Spectrum™ Ethernet switches, and NVIDIA’s GPUs, Magnum IO GPUDirect Storage, and ConnectX® SmartNICs provides the performance, scalability and agnostic architecture that organizations need for building on-premises supercomputing environments for AI/ML applications.
Using all three together allows enterprises to create a direct data path between NVMe-oF storage and GPU memory to drive high-performance and efficient utilization of storage and GPU resources. Western Digital has created a proof of concept demonstrating simple independent scaling of storage bandwidth to maximize GPU workloads ranging from greater than 25 GB/s for a single NVIDIA A100 Tensor Core GPU to over 100 GB/s for four NVIDIA A100 GPUs.