Microsoft is synonymous with cloud computing with its Azure server technology in several companies globally. Currently, the company uses AMD and Linux data center GPUs in its servers. However, when a new GPU needs to be replaced or installed on their servers, it requires the server to shut down to change graphics card units.
A particular driver for GPU disaggregation technology, intended for AMD graphics cards, receives help from Microsoft engineers
Microsoft has created a unique driver to enable “hot-plugging” for AMD GPUs on their Linux servers to initiate these replacements. Hot-plugging involves removing a graphics card from the PCIe slot and replacing it with another while the system is active.
Shuotao Xu, an engineer from Microsoft Research Group, posted the below request for code review for AMDGPU Hotplug Support. The patch is prepared for use in Linux operating systems. It focuses on Microsoft Azure systems to facilitate the capability of hot-pluggable GPU-based accelerators, should the need arise. The Microsoft Research Group placed a similar request on GitHub, which readers can find here.
Dear AMD Colleagues,
OWe are from Microsoft Research working on GPU disaggregation technology.
We created a patch against https://gitlab.freedesktop.org/agd5f/linux.git against drm-staging-drm-next, which will enable hot-plug PCIe support for amdgpu
We also created a pull request Add PCIe hotplug support for amdgpu by xushuotao Pull Request #131 RadeonOpenCompute/ROCK-Kernel-Driver (github.com)https://nam06.safelinks.protection.outlook.com/?url=https%3A%2F%2Fgithub.com%2FRadeonOpenCompute%2FROCK-Kernel-Driver%2Fpull%2F131&data=04%7C01%7Cshuotaoxu%40microsoft.com%7Cc86224bc365f44bec6b408da172ecac1 7C72f988bf86f141af91ab2d7cd011db47%% 7C1% 7C0% 7C637847787066456985% 7CUnknown% 7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0% 3D% 7C3000% PA8l7Cj82dphBHbo82zqTEQUM4kGM7yg5UeQuduhDg0 & sdata = 3D = 0 reserved> in ROCK-Kernel-Driver, against rocm-5.0.x.
We believe that hot-plugging support for GPU devices can open doors for many advanced data center applications over the next few years, and we’d love to hear reviews on this PR so we can continue the technical discussions around this feature.
Could you please help review this patch?
Thanks very much!
– The code review request for AMDGPU hot plug support
There is little information from Microsoft about the new GPU disaggregation technology. However, since the driver is proprietary to Microsoft, it is intended to allow Azure systems to include GPU acceleration on their servers that do not yet have a graphics card installed. Servers work harder than consumer machines, so the hot support capability of GPUs would be a very useful tool.
Hot-plugging graphics cards and accelerators via the PCIe slot is a new concept. Initial hot-plugging is used in some consumer systems, such as the eGFX enclosure, which allows an AMD card to be hot-plugged into a Thunderbolt 3 connection. Servers have yet to see this feature. With data centers becoming more prevalent in the market, this new technology would benefit Microsoft with its Azure systems, AMD and the company’s GPU lines.