H100-SXM5服务器整机测试掉卡 GPU未识别 时钟信号异常维修案例
服务器型号:H100-SXM5
报错部件:GPU子卡未识别
故障描述:GPU未识别 掉卡(不抓卡)
结论:故障检测板不在位,GPU未识别,经过检测时钟信号异常,表明故障原因是时钟部分故障,更换时钟芯片后故障修复。
故障子卡整机测试掉卡,定位检测发现读不到卡,具体表现为读不到板ID,GPU驱动未识别到。
正常板可以读到板ID:
故障表现读不到卡
GPU未识别到的原因可能有连接不良,电源故障,时钟频偏,复位异常,监控误报,软件bug,驱动匹配等多种可能,是维修定位中最为复杂且繁琐的故障之一。
使用示波器检测时钟信号时发现异常。按理论上应该是方波,信号跳变区域是时钟信号最关键的部分,直上直下不能有钩或者台阶,上升或者下降速率也要足够。实际量测时发现上升沿已经出现了严重的不单调,可以明显的看到下冲和纹波,判定时钟部分故障。
将子卡进行烘烤(烘烤的作用:去除焊球湿气,预防焊接开裂;预防汽化引起的爆板;预防湿气导致的损坏器件。)
使用维修返修台更换时钟芯片,器件更换后单卡测试pass,整机FLD压测pass。
ODS-000000000000 | nvlink | nvlink | | NVLink | SXM2_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvlink | nvlink | | NVLink | SXM6_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvlink | nvlink | | NVLink | SXM5_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvlink | nvlink | | NVLink | SXM8_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvlink | nvlink | | NVLink | SXM7_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvlink | nvlink | | NVLink | SXM3_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvlink | nvlink | | NVLink | SXM1_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvlink | nvlink | | NVLink | SXM4_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvswitch | nvswitch | | NVSwitch | SXM2_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvswitch | nvswitch | | NVSwitch | SXM4_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvswitch | nvswitch | | NVSwitch | SXM1_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvswitch | nvswitch | | NVSwitch | SXM3_SN_xxxxxxxxxxxxx| OK
MODS-000000000000 | nvswitch | nvswitch | | NVSwitch | SXM8_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvswitch | nvswitch | | NVSwitch | SXM5_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvswitch | nvswitch | | NVSwitch | SXM6_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | nvswitch | nvswitch | | NVSwitch | SXM7_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | power | power | | | SXM2_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | power | power | | | SXM7_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | power | power | | | SXM6_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | power | power | | | SXM5_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | power | power | | | SXM4_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | power | power | | | SXM8_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | power | power | | | SXM3_SN_xxxxxxxxxxxxx| OK
MODS-000000000000 | power | power | | | SXM1_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | gpu_fieldiag | gpu_fieldiag | | GPU | SXM2_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | gpu_fieldiag | gpu_fieldiag | | GPU | SXM4_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | gpu_fieldiag | gpu_fieldiag | | GPU | SXM1_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | gpu_fieldiag | gpu_fieldiag | | GPU | SXM3_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | gpu_fieldiag | gpu_fieldiag | | GPU | SXM8_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | gpu_fieldiag | gpu_fieldiag | | GPU | SXM5_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | gpu_fieldiag | gpu_fieldiag | | GPU | SXM7_SN_xxxxxxxxxxxxx | OK
MODS-000000000000 | gpu_fieldiag | gpu_fieldiag | | GPU | SXM6_SN_xxxxxxxxxxxxx | OK
网度通信拥有完整的配套芯片级维修检测环境与专业维修技术工程师,可检测并维修NVIDIA GPU全系列产品 包含RTX4090、5090、A100、A800、H100、H200、H800、H20、B200等多种型号算力卡维修。GPU核心BGA焊接,显存颗粒、电源控制芯片、SWITCH芯片、时钟芯片故障诊断更换、主板/数据交换板故障修复等疑难故障检测修复。
液冷服务器GPU模组维修、算力卡维修,底板电源更换提供备件维保服务,网度通信一站式服务