Break through the CUDA encirclement circle, and make another move.

Break through the CUDA encirclement circle, and make another move.

On the impact of CUDA in the AI industry, there is no need to elaborate further; we have also discussed this in many previous articles in the semiconductor industry observation.

Foreign media HPCwire has also previously stated bluntly that the cooperation between GenAI and GPUs with Nvidia is not accidental. Nvidia has always recognized the need for tools and applications to help its market grow. They have created very low barriers to entry for software tools (such as CUDA) and optimized libraries (such as cuDNN) for Nvidia hardware.

Indeed, Nvidia is known as a hardware company. But as Bryan Catanzaro, Vice President of Applied Deep Learning Research at Nvidia, said, "Many people don't know this, but Nvidia has more software engineers than hardware engineers." Nvidia has built a strong software "moat" around its hardware.

Although CUDA is not open source, it is freely provided and firmly controlled by Nvidia. While this situation benefits Nvidia (justifiably so. They have invested time and money in CUDA), it poses difficulties for companies and users who hope to capture a part of the HPC and GenAI market through alternative hardware.

Advertisement

However, a new plan is now eager to try.

SCALE, emerging out of the blue

As Phoronix said, in order to break through the CUDA moat, there have been various efforts, such as HIPIFY, which helps to convert CUDA source code into portable C++ code suitable for AMD GPUs, followed by ZLUDA, previously funded by AMD, which allows CUDA binaries to run on AMD GPUs through direct replacement of the CUDA library.

But now there is a new competitor: SCALE. SCALE has now been publicly released as a GPGPU toolchain, allowing CUDA programs to run natively on AMD graphics processors.

According to the introduction, SCALE was created by the British company Spectral Compute over seven years. SCALE is a "clean room" implementation of CUDA, which uses some open-source LLVM components while forming a solution that can be locally compiled into CUDA source code suitable for AMD GPUs without modification.

Compared with other projects that only help code conversion by converting to another "portable" language or involving other manual developer steps, this is a huge advantage.SCALE can use CUDA programs as is, and can even handle CUDA programs that depend on NVPTX assembly. The SCALE compiler is also an alternative to NVIDIA's nvcc compiler and has the ability to "impersonate" the runtime of the NVIDIA CUDA toolkit.

SCALE has successfully passed tests with software such as Blender, Llama-cpp, XGboost, FAISS, GOMC, STDGPU, Hashcat, and even NVIDIA Thrust. Spectral Compute has been testing SCALE on RDNA2 and RDNA3 GPUs, with basic testing on RDNA1, and Vega support is ongoing.

Essentially, SCALE is an nvcc-compatible compiler that can compile CUDA code for AMD GPUs, the CUDA runtime and driver API implementation for AMD GPUs, and the implementation of open-source wrapper libraries, which in turn interact with AMD's ROCm libraries.

For example, while ZLUDA is quietly funded by AMD, Spectral Compute has stated that they have been funding this development through their consulting business since 2017. The only direct disadvantage of SCALE is that it is not open-source software itself, but there is at least a free version license available for users.

According to the official documentation, SCALE is a GPGPU programming toolkit that allows CUDA applications to be natively compiled for AMD GPUs. SCALE does not require modifications to the CUDA program or its build system, and support for more GPU vendors and CUDA APIs is in development.

In terms of composition, SCALE includes:

1. An nvcc-compatible compiler capable of compiling nvcc-dialect CUDA for AMD GPUs, including PTX asm.

2. The implementation of the CUDA runtime and driver API for AMD GPUs.

3. Open-source wrapper libraries that provide the "CUDA-X" API by delegating to the corresponding ROCm libraries. This is how libraries like cuBLAS handle cuSOLVER.

Unlike other solutions, SCALE does not provide a new way to write GPGPU software, but allows programs written in the widely popular CUDA language to be directly compiled for AMD GPUs. At the same time, SCALE aims to be fully compatible with NVIDIA CUDA, as they believe users should not have to maintain multiple codebases or sacrifice performance to support multiple GPU vendors.Finally, the developers have stated that the language of SCALE is a superset of NVIDIA CUDA, offering some optional language extensions that allow those who wish to break away to write GPU code more easily and efficiently with nvcc.

In summary, compared to other cross-platform GPGPU solutions, SCALE has several key innovations:

1. SCALE accepts CUDA programs as they are, without the need to port them to other languages. This is even the case for programs that use inline PTX asm.

2. The SCALE compiler accepts the same command-line options and CUDA dialect as nvcc, serving as a replacement.

3. It "emulates" the installation of the NVIDIA CUDA toolkit, so existing build tools and scripts can work normally with cmake.

Specifically, in terms of hardware support, the following GPUs will now be supported and tested:

- AMD GFX1030 (Navi 21, RDNA 2.0)

- AMD GFX1100 (Navi 31, RDNA 3.0)

The following GPU targets have been temporarily manually tested and "seem to work":

- AMD GFX1010AMD GFX1101

Spectral Compute is working on supporting AMD gfx900 (Vega 10, GCN 5.0), and may target other GPGPUs.

Of course, as previously mentioned, they will support more GPUs.

Breaking through CUDA, AMD and Intel's approach

As another major player in the GPU market, AMD is also crossing the CUDA moat through various means.

In the view of HPCwire, replacing Nvidia hardware means that GPUs and accelerators from other vendors must support CUDA to run many models and tools. AMD has also achieved this through the HIP CUDA conversion tool. It is understood that this is a C++ runtime API and kernel language that allows developers to create portable applications for AMD and NVIDIA GPUs from a single source code. It should be emphasized that HIP is not CUDA, it is natively based on AMD ROCm, which is AMD's equivalent to Nvidia CUDA.

AMD also provides the open-source HIPIFY conversion tool. HIPIFY can take CUDA source code and convert it to AMD HIP, which can then run on AMD GPU hardware. Naturally, this is also part of its ROCm stack.

AMD is also working with third-party developers to launch the ZLUDA project, allowing AMD's GPUs to run Nvidia CUDA applications as well. ZLUDA runs unmodified binary CUDA applications on AMD GPUs with performance close to native. ZLUDA is considered to be of alpha quality and has been confirmed to work with various native CUDA applications (such as LAMMPS, NAMD, OpenFOAM, etc.). Until recently, AMD has quietly funded ZLUDA, but the sponsorship has ended. The project is still ongoing because someone recently submitted a codebase.

On the Intel side, they have also made many attempts.

In a speech in September 2023, Intel's Chief Technology Officer Greg Lavender suggested that we should build a large language model (LLM) and convert it to something that can run on other AI accelerators - such as its own Gaudi2 or GPU Max hardware. "I challenge all developers to use LLM and Copilot technologies to train machine learning models and convert all CUDA code to SYCL," said Greg Lavender.It is reported that SYCL provides a consistent programming language across CPUs, GPUs, FPGAs, and AI accelerators within a heterogeneous framework, where each architecture can be used individually or in combination for programming and utilization. The language and API extensions in SYCL support various development use cases, including the development of new offloading acceleration or heterogeneous computing applications, converting existing C or C++ code to SYCL-compatible code, and migrating from other accelerator languages or frameworks.

Specifically, SYCL (or more specifically, SYCLomatic) is a royalty-free cross-architecture abstraction layer that supports Intel's parallel C++ programming language.

In short, SYCL handles most of the heavy lifting (allegedly up to 95%), which is to port CUDA code to a format that can run on non-Nvidia accelerators. However, as you might expect, some fine-tuning and adjustments are usually required to get the application running at full speed.

"If you want to make full use of Intel GPUs (instead of AMD GPUs or Nvidia GPUs), then you must take some measures, either through SYCL's extension mechanism or simply by building the code," explained Joe Curley, Intel's Vice President of Software Products and Ecosystem.

Meanwhile, a group consisting of Intel, Google, Arm, Qualcomm, Samsung, and other technology companies is developing an open-source software suite to prevent AI developers from being bound by Nvidia's proprietary technology, allowing their code to run on any machine and any chip.

This organization, called the "Unified Acceleration Foundation" (UAF: Unified Acceleration Foundation), told Reuters that the technical details of the project should reach a "mature" state in the second half of this year, but the final release target has not yet been determined. The project currently includes the OneAPI open standard developed by Intel, which aims to eliminate the requirements for specific coding languages, code libraries, and other tools, avoiding the need for developers to use specific architectures, such as Nvidia's CUDA platform. For more details, you can refer to the previous article "Breaking the Dominance of CUDA."

However, this seems to be not enough, and more manufacturers are taking action.

There are more solutions.

As everyone knows, in the HPC field, applications supporting CUDA dominate the world of GPU acceleration. When using GPUs and CUDA, porting code can usually achieve a speedup of 5-6 times. (Note: Not all code can achieve this acceleration, and some code may not be able to use GPU hardware).

However, the situation is quite different in GenAI.Initially, TensorFlow was the preferred tool for creating AI applications using GPUs. It is suitable for both CPUs and can be accelerated by CUDA on GPUs. This situation is changing rapidly. TensorFlow's alternative is PyTorch, an open-source machine learning library for developing and training deep learning models based on neural networks. It is primarily developed by Facebook's AI research team.

Ryan O'Connor, a developer educator at AssemblyAI, pointed out in a blog post that 92% of the available models on the popular website HuggingFace are exclusive to PyTorch. Users can download and integrate the latest trained and fine-tuned models into their application pipelines with just a few lines of code.

A comparison of machine learning papers shows a clear trend towards using PyTorch and moving away from TensorFlow.

Of course, PyTorch is built on CUDA calls at its core, but this is not mandatory, as PyTorch isolates users from the underlying GPU architecture. There is also a version of PyTorch that uses AMD ROCm, an open-source software stack for AMD GPU programming. Crossing the CUDA moat with AMD GPUs might be as simple as using PyTorch. In a previous article titled "CUDA is being dethroned," the impact of PyTorch on CUDA was discussed.

What are everyone's thoughts on these options?

Comments