# Cache-Based Attacks on Cryptographic Implementations

**Source**: https://quantumsequrity.com/blog/cache-based-attacks-crypto
**Category**: Threats & Attacks

---

[← Back to Blog](../../blog.html) Threats & Attacks

# Cache-Based Attacks on Cryptographic Implementations

11 min read

When you encrypt a file, the math behind the cipher is only half the story. The other half is what happens inside your computer's CPU while that math runs. Modern processors have small, fast memories called caches that sit between the CPU and main memory. Caches make programs run faster by keeping recently used data close at hand. They also leak secrets.

A cache-based attack watches the patterns of cache hits and misses while a cipher runs. From those patterns, an attacker can reconstruct secret keys without ever reading the key directly. The attacker does not need physical access to the chip. A program running on the same machine, or even a different virtual machine on the same host, can pull this off. Cloud servers, shared hosting, and consumer laptops are all potential targets.

This post walks through the three main families of cache attacks: Flush+Reload, Prime+Probe, and Evict+Reload. We will look at how each one works against AES, why T-table implementations of AES were so vulnerable, and how constant-time code defends against the whole class of attacks.

## How CPU Caches Leak Information

A CPU cache holds copies of memory the processor has used recently. When the program asks for data, the CPU first checks the cache. A hit returns the data in a few nanoseconds. A miss forces the CPU to fetch from main memory, which takes hundreds of nanoseconds. The timing gap between hit and miss is huge, and it is measurable from software using instructions like rdtsc on x86.

Caches are organized into lines, typically 64 bytes each, and grouped into sets. When the processor loads an address, the cache figures out which set it belongs to and either places it in a free line or evicts an older line to make room. This eviction policy is what makes side-channel attacks possible. If two processes share the same cache, they can interfere with each other's cache lines, and that interference is observable.

In a multi-tenant environment, the last-level cache (LLC) is shared across all CPU cores. That means a process on core 0 and a process on core 1 share the same LLC. If both processes touch the same memory address, they share a cache line. If they touch addresses that map to the same set, they fight for the same lines. Both situations create a side channel.

## Flush+Reload: The Surgical Strike

Flush+Reload was published by Yuval Yarom and Katrina Falkner at USENIX Security 2014. It is the most precise of the cache attacks, and it works whenever the attacker shares memory with the victim. Shared memory happens more often than you might think. Operating systems use copy-on-write to share read-only pages of executables and shared libraries between processes. Two processes running OpenSSL share the OpenSSL code pages and any constant lookup tables.

The attack runs in three steps. First, the attacker uses the clflush instruction to flush a specific cache line, typically pointing at a known offset inside the AES T-tables loaded by the victim. Second, the attacker waits a tiny amount of time. Third, the attacker reads the flushed address and times the access. If the read is fast, it means the victim accessed that address during the wait window and reloaded the line into the cache. If the read is slow, the victim did not touch that line.

By repeating this across every offset of the T-table, the attacker learns exactly which table entries the victim accessed during a single AES encryption. Those entries depend on bytes of the secret key combined with bytes of the plaintext. With a few thousand encryptions, the attacker can recover the full 128-bit AES key.

Flush+Reload works across virtual machines on the same hypervisor when memory deduplication is enabled. Researchers Ahmad Moghimi, Gorka Irazoqui, and Thomas Eisenbarth showed in their 2018 CHES paper "MemJam" that memory deduplication features in VMware and KVM can let one virtual machine attack another sharing the same hypervisor.

## Prime+Probe: When You Cannot Share Memory

Prime+Probe does not need shared memory. It exploits cache contention instead. Daniel Osvik, Adi Shamir, and Eran Tromer described it in detail in their 2006 paper "Cache Attacks and Countermeasures: The Case of AES" presented at CT-RSA 2006.

The attacker fills, or primes, an entire cache set with their own data by reading enough addresses that all map to the same set. Then the attacker waits for the victim to run. If the victim accesses any address that maps to the primed set, it evicts one of the attacker's lines. When the attacker probes the set by re-reading their own data, the evicted line takes longer to load, and that timing difference reveals the victim's activity.

Prime+Probe is noisier than Flush+Reload. The attacker learns only that the victim touched some address in a particular set, not which specific address. Modern CPUs can hold thousands of distinct addresses per set. Recovery takes more samples and more statistical work, but the attack is fundamentally still feasible. Mehmet Sinan Inci and his coauthors demonstrated full AES key recovery from Amazon EC2 co-tenant virtual machines using Prime+Probe in their 2016 IACR ePrint paper "Cache Attacks Enable Bulk Key Recovery on the Cloud."

## Evict+Reload: A Hybrid

Evict+Reload, introduced by Daniel Gruss and colleagues in their 2015 paper "Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches" at USENIX Security 2015, combines elements of both. Like Prime+Probe, it works without the clflush instruction. Like Flush+Reload, it relies on shared memory between attacker and victim.

The attacker first reads a victim address into the cache to load it. Then the attacker evicts that address by accessing enough other addresses that map to the same set. After waiting, the attacker reloads the original address and times the access. A fast read means the victim accessed it during the wait. A slow read means they did not.

Evict+Reload is useful on hardware that disables clflush in user space, like ARM platforms in some configurations. It has been used to attack AES, RSA, and TLS implementations on Android devices.

## Why AES T-Table Implementations Were a Disaster

Classic AES implementations used four 1KB lookup tables, called T-tables, to fuse the SubBytes and MixColumns operations into a single table lookup per round. Tables made AES fast on general-purpose CPUs, but they made it leak. Each table access reads memory at an index that depends on the round key XOR'd with the state. Cache line resolution, typically 64 bytes, means the upper 4 bits of each lookup index are visible to a cache attacker. Across many encryptions, the attacker can statistically reconstruct the upper bits of every key byte and then brute-force the rest.

Daniel Bernstein's 2005 paper "Cache-Timing Attacks on AES" demonstrated remote key recovery from a network server using only timing measurements over the network, no shared memory required. The paper triggered a wave of research and convinced the cryptographic community that table-based AES had to go.

## The Move to Constant-Time AES

The defense is constant-time code. A constant-time implementation has no secret-dependent memory access patterns and no secret-dependent branches. For AES, this means either using the AES-NI hardware instruction set on Intel and AMD CPUs, or using bitsliced software AES that performs all S-box operations using bitwise logic instead of table lookups.

AES-NI was introduced by Intel in 2010. It executes a full AES round in a single instruction, in constant time, with no microarchitectural side channels. ARM's equivalent is the AES extension introduced in ARMv8. Modern OpenSSL, BoringSSL, and libsodium all detect AES-NI at runtime and use it whenever available. When AES-NI is unavailable, libsodium and BearSSL fall back to bitsliced implementations.

Bitsliced AES treats each bit of the state as a separate lane in a 128-bit register and computes all 128 lanes in parallel using boolean logic. There are no tables, no branches, and no secret-dependent timing. The cost is that bitsliced AES is slower than table-based AES, but with modern SIMD it remains acceptable for most workloads.

For deeper context on AES modes and how authentication interacts with these implementations, see [AES-256-GCM Explained](aes-256-gcm-explained.md).

## What This Means for Post-Quantum Cryptography

Post-quantum schemes are not immune to cache attacks. ML-KEM, the NIST-standardized lattice KEM, performs polynomial arithmetic with secret-dependent operations. A naive implementation of the rejection sampling step in ML-KEM key generation could leak through a cache channel. The reference implementations released as part of the NIST PQC standardization process use constant-time arithmetic and avoid secret-dependent table lookups. The FIPS 203 standard explicitly requires constant-time implementations for production use.

ML-DSA, the lattice signature scheme, also requires constant-time rejection sampling and constant-time NTT operations. The 2022 paper "KyberSlash" by Goutam Tamvada and Daniel Genkin showed that early ML-KEM implementations had timing variations in the polynomial division step. The reference implementation was patched in 2023 to use Barrett reduction with constant-time conditional subtraction.

For QNSQY users, this matters because the cryptographic core ships with the NIST reference implementations of ML-KEM and ML-DSA, both of which have been audited for constant-time behavior. Hybrid mode pairs them with X25519 and Ed25519, both of which use constant-time scalar multiplication on Curve25519. To understand why hybrid mode is the right choice during this transition, see [Hybrid Encryption](hybrid-encryption.md).

## Real-World Defenses Beyond Constant-Time Code

Constant-time code is necessary but not sufficient. The CPU itself can introduce timing variations through speculative execution, branch prediction, and hardware prefetching. Mitigations against these effects fall into several categories.

Address space layout randomization can complicate attacks that depend on knowing the absolute virtual address of a target buffer, but it does not stop attacks based on relative offsets within a single object. Process isolation through seccomp or sandboxing limits which processes can run on the same CPU as your cryptographic code. Cloud platforms now offer dedicated instance types where you do not share physical hardware with other tenants.

For the highest-value workloads, hardware security modules and trusted execution environments like Intel SGX provide isolated execution, but research has shown that even SGX is vulnerable to certain cache and speculation attacks. The 2018 Foreshadow attack demonstrated extraction of SGX attestation keys. The lesson is that hardware isolation is a defense in depth, not a complete defense.

## FAQ

**Can a webpage in my browser run a cache attack on my keys?**
Pure JavaScript was once able to mount limited cache attacks until browsers reduced timer resolution and added cross-origin isolation requirements for SharedArrayBuffer. Today, attacks from JavaScript are difficult but not impossible. WebAssembly running in a poorly isolated browser tab is a more realistic threat model than plain JavaScript.

**Does AES-NI completely solve the problem?**
For AES specifically, AES-NI eliminates table-based leakage and runs in constant time. It does not protect other cryptographic primitives like ChaCha20 or post-quantum schemes, which need their own constant-time implementations.

**Is my cloud server safe?**
Major cloud providers run mitigations, but the threat surface is large. Choosing dedicated instance types, avoiding burstable shared instances for sensitive workloads, and keeping cryptographic operations inside HSMs reduces but does not eliminate exposure.

**How do I tell if my crypto library is constant-time?**
Tools like dudect and ctgrind perform statistical timing analysis on cryptographic functions. The Wycheproof project from Google publishes test vectors that include timing-related edge cases. Reputable libraries publish their constant-time analysis in security audits.

**Are post-quantum algorithms more or less vulnerable than classical ones?**
The risk is comparable. Both classical and post-quantum algorithms have specific operations that need constant-time implementation. The good news is that lessons learned from decades of classical crypto side-channel research have informed the design of post-quantum reference implementations.

## Sources

1. Yarom, Y. and Falkner, K. "FLUSH+RELOAD: A High Resolution, Low Noise, L3 Cache Side-Channel Attack." USENIX Security 2014. https://www.usenix.org/conference/usenixsecurity14/technical-sessions/presentation/yarom
2. Osvik, D., Shamir, A., and Tromer, E. "Cache Attacks and Countermeasures: The Case of AES." CT-RSA 2006. https://eprint.iacr.org/2005/271
3. Gruss, D., Spreitzer, R., and Mangard, S. "Cache Template Attacks: Automating Attacks on Inclusive Last-Level Caches." USENIX Security 2015. https://www.usenix.org/conference/usenixsecurity15/technical-sessions/presentation/gruss
4. Bernstein, D. J. "Cache-Timing Attacks on AES." 2005. https://cr.yp.to/antiforgery/cachetiming-20050414.pdf
5. NIST FIPS 203. "Module-Lattice-Based Key-Encapsulation Mechanism Standard." 2024. https://csrc.nist.gov/pubs/fips/203/final
6. Inci, M. S., Gulmezoglu, B., Irazoqui, G., Eisenbarth, T., and Sunar, B. "Cache Attacks Enable Bulk Key Recovery on the Cloud." IACR ePrint 2016/596. https://eprint.iacr.org/2016/596

## Related Articles

- [What is Post-Quantum Cryptography?](what-is-post-quantum-cryptography.md)
- [AES-256-GCM Explained](aes-256-gcm-explained.md)
- [Hybrid Encryption](hybrid-encryption.md)
- [ML-KEM Explained](ml-kem-explained.md)
- [Why RSA-2048 Will Break](why-rsa-2048-will-break.md)

---

### Protect Your Data Before Q-Day Arrives

QNSQY's NIST-standardized post-quantum encryption protects files against both current and quantum-era threats.

[Try QNSQY](../../pricing.html)