Daniel Andriesse, INT31
Neer Roggel, INT31
If you’re interested in computer security, it’s likely that you have heard about glitching attacks such as Plundervolt that use momentary undervolting to corrupt computations in ways that break security guarantees. In response to these threats, software undervolting interfaces are now being increasingly locked down, making undervolting-based glitching more difficult.
In this article, we want to discuss a potential way of glitching that could cause Plundervolt-like effects—denial of service, transient computational errors, silent data corruption, and more—without relying on any direct or indirect control over voltage at all: Software-Based Thermal Glitching (SBTG). Can SBTG produce these effects in practice? We’ll spoil the answer: we don’t know yet. The process of finding out has proven tricky, but exciting. That’s why we want to share what we’ve learned so far.
Undervolting and Glitching
Given the ever-increasing transistor and power density in computer processors, chip designers face significant challenges in balancing power efficiency, performance, and thermal limitations. To accommodate the varying demands on the system, modern CPUs do not operate at a fixed voltage and frequency, but dynamically adjust these parameters at runtime, factoring in concerns like the performance requirements of the workload and the effects of physical influences like temperature on the electrical properties of the materials of which the CPU is made.
These dynamic adjustments are handled by dedicated circuitry and firmware, typically integrated on the CPU, in conjunction with power delivery and voltage regulation circuits on the motherboard.1 This is known as Dynamic Voltage and Frequency Scaling (DVFS). While this process happens automatically for the most part, CPUs also expose software interfaces that allow the end user to adjust the voltage and frequency to their needs, either in BIOS or at runtime. An example of such an interface on x86 is MSR 0x150, otherwise known as the overclocking mailbox (OC mailbox).
In recent years, these interfaces have received significant academic, hacker, and industry attention because careful momentary adjustments of the voltage or frequency (or both) at runtime can be misused to induce glitches in computations performed by the processor, not only on x86, but also on other architectures. In essence, the idea is to induce errors in computations by bringing the CPU just below the threshold of stability, with voltage or frequency settings slightly outside the supported performance envelope.
When timed correctly, it is possible to corrupt sensitive computations without crashing the system outright. For instance, a glitch might alter a memory address so that data leaks out to where it shouldn’t be, or it could alter the value of a variable that the attacker should have no control over (silent data corruption). This has been shown in such work as V0ltpwn, CLKSCREW, and Plundervolt, which used voltage glitching to break the security of cryptographic operations running in an SGX enclave.
Glitching can be achieved in many ways, depending on the target system. Aside from voltage and frequency adjustments, old gaming consoles such as the Nintendo 64 or Sega Genesis were susceptible to glitches caused by inserting the cartridge slightly askew, to name a very hands-on example.
Glitching attacks can be either hardware-based or software-based. In hardware-based attacks, the attacker physically tampers with the circuit, for example with the voltage regulators. This can be used to glitch boot-time firmware checks, for example, so that custom firmware can be used on an embedded system. But for our purposes, software-based glitching attacks (where attackers use only software interfaces) are especially interesting because they open the door for remote attackers without physical access to the system, greatly broadening the scope of the threat. Moreover, they require no specialized equipment. This is why there has been such widespread attention for undervolting-based attacks like Plundervolt that use software interfaces such as the OC mailbox.
In light of these threats, Intel has tightened security on interfaces like the OC mailbox and implemented features including Undervolt Protection (UVP). In some of our previous work, we have explored and mitigated alternative variants of undervolting attacks, to test the limits of these defenses.
Ultimately, however, undervolting-based glitching is (hopefully) becoming more difficult. This raises the question: what alternative avenues for glitching remain? Software-based thermal glitching could be one such alternative approach. But before we discuss the details of SBTG, let’s consider why we think thermal effects could yield results similar to undervolting.
How SBTG Could Produce Undervolting-like Effects
The electrical and thermal properties of materials are innately linked. For example, in metals, a higher temperature generally (near room temperature) increases the resistance of the material, so that signals have a harder time “getting through.” This is illustrated in the following graph, which shows a typical relation between Vmin (the minimum operating voltage the CPU needs to prevent it from crashing) and temperature (in degrees Celsius), as measured experimentally on a particular platform at a fixed frequency of 3.5 GHz.
As you can see, at higher temperatures the system needs a higher voltage to remain stable. If the voltage is insufficient, then signals traveling through the metal pathways on a CPU die could become corrupted, or delayed enough to exceed timing tolerances, thus resulting in a glitch. This can happen if, for instance, the high temperature is localized (a hotspot), and is not near one of the (relatively sparse) thermal sensors on the die.
In semiconductors, the situation is more complex. At low voltages and small processes (sub-90 nm), gate delay can decrease (instead of increase) with rising temperature, meaning that higher frequencies are attainable at hotter temperatures, a phenomenon known as Inverse Temperature Dependence (ITD). The gate delay can also follow a non-linear profile depending on the details of the semiconductor, so that the relationship between temperature and gate delay is “conventional” on some parts of the curve, and “inverted” on others.
Non-homogeneous temperature distributions (one part of the die being hotter than another) complicate matters even further. We refer to such temperature distributions as thermal gradients. Due to the varying temperature, and thus varying resistances and gate delays, that a signal encounters when traveling along a gradient, the signal may not only become delayed or attenuated, but its shape may also become distorted due to a variety of effects, such as RC filtering.
All this to say that investigating the glitching potential of thermals is not as simple as “run the system as hot as possible and see what happens.” If the CPU fails to properly scale the voltage and frequency for the physical conditions it is subjected to, then issues may arise variously due to hot spots, cold spots, or thermal gradients that involve a combination of both. Even the way thermal conditions vary over time can play a role, leading to a large and complicated state space to explore.
The question is not whether these effects exist. We know from physics that they do. The question is whether we can make these effects pronounced enough to cause targeted and repeatable glitches, and whether we can do so using only software-based methods such as running (user-space) workloads that heat the die in a specific (not necessarily homogeneous) way.
Creating and Monitoring Thermal Patterns on the Die
As we mentioned earlier, the digital thermal sensors (DTS) that monitor temperature on the die are relatively sparse; they each take up a fair amount of floor space, so we can only place so many of them. Chip designers run extensive tests of the CPU’s thermal behavior to determine the optimal placement of the thermal sensors. This includes simulations of speedpaths (timing-critical paths on the die) using static timing analysis software like Synopsis PrimeTime as well as post-silicon testing.
Simulation-based testing can catch many potential problems early, but it is computationally intensive, necessitating the use of simplified models of chip behavior to decrease computational cost. Moreover, it is infeasible to exhaustively simulate the full range of voltage, frequency, and thermal conditions. This means that no simulation can fully rule out thermal issues. Therefore, there is a case to be made for post-silicon thermal testing.
In our research, we explore more complex thermal scenarios (such as thermal gradients and transients) than are covered in standard post-silicon testing. This requires a more fine-grained picture of the thermal state on-die at a given moment than the standard DTS sensors yield. We therefore use a Thermal Modeling Tool (TMT), developed at Intel.
In a nutshell, TMT increases the spatial resolution with which we can monitor thermals on-die by augmenting the data from the standard thermal sensors with data from intra-die variation probes (IDV probes), which are simpler sensors that are scattered around the die. These probes are used during manufacturing to measure variations in the fabrication process. While they cannot directly measure temperature, they can measure other properties such as the local voltage. Since the voltage measured by the probe depends on temperature (among other things) it can be used as a proxy measurement.
TMT uses a machine learning approach to learn the relationship between IDV readings and temperature. To enable TMT for a particular CPU sample, we go through an initial calibration phase where we use specialized hardware that leverages the Peltier effect to heat the CPU to a range of known temperatures. Since the temperatures are known, TMT can learn the correlation with the IDV readings.
After the calibration phase is done, we return to stock cooling to run realistic tests for our thermals research. As you can see in the following figure, TMT allows us to get a fairly detailed idea of what the thermal situation is on the die when we’re running a particular test workload.
Another ingredient that we need to conduct thermals research is the ability to not only measure thermal patterns, but also induce them. Ideally, we want to be able to run a wide range of tests without having to painstakingly handcraft workloads for each pattern we want to try.
To this end we developed a program that we think of as a thermal waveform generator, which is something like an arbitrary waveform generator used in electronics. It can generate thermal patterns over space and over time based on high level mathematical or procedural descriptions of the desired pattern. For instance, we can write rules that say “create a thermal sine wave in the floating point unit, with a period of one second and the temperature ranging between 40 and 90 degrees Celsius.”
The waveform generator has a library of “building block” workloads that induce known thermal patterns, and it automatically tries to piece together and modulate these building blocks to create an approximation of the requested thermal pattern. This way, we can easily test many thermal scenarios without having to write specialized test programs for each case. The following figure shows a simple example of a slow thermal sine wave created by the thermal waveform generator on the basis of a mathematical description of the pattern function. The resulting pattern was measured and plotted with TMT.
Of course, the waveform generator works on a best effort principle; it cannot create physically impossible thermal patterns, but it does its best to generate a pattern as close as possible to what was requested. The main loop of the thermal waveform generator (where the waveform actuation happens) is highly optimized to minimize the impact of the pattern generation logic itself on the CPU’s workload. Using this approach, we are currently able to achieve a temporal resolution on the order of a millisecond.
Experimental Results and Future Work
Using this methodology, we have investigated, or are in the process of investigating, various classes of thermal glitching attacks.
We categorize SBTG attack classes using multiple dimensions. First of all, we distinguish short-lived thermal spikes (thermal transients) from stable thermal gradients. Moreover, gradients can occur on various scales, ranging from within-core gradients (gradients within a single CPU core), to between-cores gradients (gradients between multiple cores on the same die), or even SoC-wide gradients (gradients that stretch beyond the computation cores and ring interconnect onto other parts of the SoC). Finally, we differentiate between traditional hot spots, and cold spots that could cause issues due to the Inverse Temperature Dependence effect.
To test a particular SBTG scenario, we use our thermal waveform generator to induce a thermal pattern, and simultaneously run an appropriate glitch test that checks for glitches during the experiment. A typical glitch test will continuously execute computations that cause a data stream across the region of the die affected by the induced thermal pattern while checking for corruptions to this data. For instance, for a within-core gradient in a Redwood Cove core that spans from the floating point unit to the mid-level cache (a relatively long path in the Redwood Cove P-cores used in Meteor Lake, see the figure below), a suitable glitch test could involve floating point computations that fetch operands from and store results in the MLC.
We’ll highlight two cases we’ve tested so far, starting with a case where we induce very rapid thermal spikes (on the order of a millisecond) within a single core, alternately generating as much heat as we can in a quick burst, then allowing the core to cool for an instant before repeating the pattern. The idea is to time these temperature spikes such that they occur in between polls of the DTS sensors, so that the spikes go unnoticed by the control algorithm, and hence the appropriate actuation does not happen.
Our findings suggest that we need not worry about this particular scenario in practice. Although we can spike the temperature rapidly, heat dissipation on the die seems to take longer on average than heat generation. Not all of the heat dissipates before the next spike. This means that the resulting thermal signal has a “long tail” on each individual spike, and over time there is an average temperature increase on the die because of the accumulation of residual heat. This temperature increase is noticeable to the control algorithm, which actuates appropriately. Hence, we are unable to achieve glitches this way.
Thus, there appears to be a physical tradeoff that prevents this class of attack. To cause a significant effect on the electrical signals we are trying to glitch, the thermal spikes need to achieve high temperatures (we want to maximize amplitude). But the bigger the spikes, the longer the heat dissipation takes, and the faster the die accumulates residual heat over time. Conversely, if we avoid heat accumulation by reducing the amplitude of the thermal spikes, then the effect on the electrical signals is reduced to the point where glitches cannot occur.
Another scenario that we are currently investigating involves the combination of stable gradients with fast thermal spikes of the sort mentioned above. In the context of this scenario, it’s not so much the thermal profile of the spikes that we care about, but rather the sudden power draw caused by these spikes. The idea is to use a stable thermal gradient to induce a baseline effect on signals traveling across it, while simultaneously using repeated voltage droops due to the rapid power fluctuations to “tip it over the edge” into glitch territory, before the control algorithm has a chance to respond.
We are in the process of testing this idea on multiple paths, including the floating point unit to MLC path mentioned above, on multiple CPU generations and using various workloads and power spike frequencies. To maximize the potential of this line of research, we are working with DVFS and power delivery experts at Intel to leverage specialized power viruses as building block workloads for the thermal waveform generator. These power viruses use low-level knowledge of chip design details to produce pathological worst-case scenarios, allowing us to achieve optimal spikes in power draw and increase the steepness of thermal gradients.
Whether or not this line of research ultimately produces a glitch, we’ll certainly learn a lot about how far we can push the limits of modern x86 processors, and we’ll achieve a better understanding of how to safely and securely handle thermal and electrical interactions on-die.
Conclusion
Thermals offer a vast and varied space of potential glitches to explore. But gaining insight into this space is difficult. The physics of thermal effects on electrical signals traversing a maze of metal pathways and semiconductor transistors switching at gigahertz speeds is complex. Moreover, the default thermal sensors offered by the CPU offer little insight into the details of these effects as they occur. We use a methodology that incorporates a thermal modeling tool to allow more detailed insights, with greater temporal and spatial resolution than standard DTS sensors, and we have developed a software thermal waveform generator to rapidly prototype a wide range of SBTG patterns described at a high level. We hope this article has given you an idea of how we are currently putting this methodology to good effect to make a start with illuminating this broad and interesting area.
Footnote
1 On systems with fully integrated voltage regulators (FIVR) more of the voltage regulation circuitry is integrated on the CPU die rather than the motherboard.
Share Your Feedback
We want to hear from you. Send comments, questions, and feedback to the INT31 team.
About the Authors
Daniel Andriesse is a security researcher in Intel's INT31 team, who focuses on glitching-related issues. Daniel is a passionate and seasoned researcher, with 15+ years of experience hacking and defending an assortment of systems. Before joining Intel, he was a member of the VUSec research group at Vrije Universiteit Amsterdam, where he completed his PhD on the topic of binary analysis. He is also the author of the book Practical Binary Analysis, and was one of the main reverse engineers involved in Operation Tovar, the FBI-sponsored takedown of the GameOver Zeus peer-to-peer botnet.
Neer Roggel is a security researcher in Intel's INT31 team, leading client platform security research. Neer is a passionate and seasoned researcher, with 20+ years of experience hacking and defending an assortment of systems. At Intel, he has led red teams to mitigate architectural and implementation vulnerabilities in a variety of security technologies, including Intel® Trusted eXecution Technology and confidential computing technologies. In his current role, he tackles reported breaches and simulates threat actors to discover and exploit deep vulnerabilities, ultimately driving preemptive, principled, and pragmatic defenses into Intel platforms. His current focus is on power management security, a new and evolving topic. Neer’s background is in securing critical infrastructure (energy, water, railways, and telephony), product evaluation and penetration testing (OS internals and networking, reverse engineering, applied cryptography), malware analysis and incident response. He holds an MSc in Computer Science from the Technion, Israel Institute of Technology, with a focus on privacy enhancing technologies.