Intel® Data Center Diagnostic Tool Now Available for Deployment

Published: 07/12/2021

By Arjan van de Ven

Intel Beta Tool Meets Production Release Qualification to Help Customers Manage System Health

A few months back, I introduced the beta version of a tool to help our customers manage their data center systems and meet their increasing expectations for high reliability and availability. Following customer trials and further enhancements to the tool, I am excited to announce the general availability release of the Intel® Data Center Diagnostic Tool. 

In technology products, including semiconductor electronics, faults can be driven by multiple factors, including early life failures, random defects in manufacturing, test coverage gaps, logic issues, wear-out and even cosmic radiation. Intel has been working with Hyperscale Cloud Providers1,2 to develop tools to help detect and limit the impact of these faults. With the release of the Intel Data Center Diagnostic Tool, also referred to as Intel DCDIAG in short, private and hybrid-cloud data center managers can now benefit from this effort and deploy a tool specifically designed to help maintain their fleet health.

High performance computing (HPC) customers, like many in industry, have historically used public benchmarking tools to help debug system errors. However, because these tools are not specifically designed for fleet maintenance, they often are unable to pinpoint the exact failing component in the system leading to flaky systems being checked multiple times before a fault is found. 

During our beta trial, we worked with HPC customers who ran the tool on systems in debug and were able to narrow an issue down to a specific faulty CPU in less than hour. This quick detection saved the customer significant debug time and helped them resolve an intermittent issue that had been plaguing them for weeks. 

In this example, the customer ran the full test which takes about 45 minutes at 100% utilization. Intel understands each customer workload is unique, so we have created multiple options for running the diagnostic tool to help fit your predictive maintenance plan. In addition to a full periodic test, the tool can also run trickle testing in the background. The trickle testing mode runs tests for one second every hour in the background, reducing the burden of taking systems out of service to run the full test. We anticipate future enhancements to include more flexibility for when and how long these trickle tests are run. 

To find out more about system requirement and access the Intel Data Center Diagnostic Tool, please visit the Support page.

Product and Performance Information

1“Silent Data Corruptions at Scale” - Facebook https://research.fb.com/publications/silent-data-corruptions-at-scale/
3

Performance varies by use, configuration and other factors. Learn more at www.Intel.com/PerformanceIndex.