I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant at 9%. What CPU frontend bottleneck could cause this?
Results
Lines | IPC | I-Cache Miss Rate |
---|---|---|
30K | 2.08 | 9.28% |
70K | 1.36 | 9.27% |
100K | 1.30 | 9.27% |
Since I-cache behavior is identical, this isn't a cache issue. What frontend limitation (decode bandwidth, micro-op cache, fetch bandwidth) could explain this?
Setup
CPU: Intel Xeon Gold 6132 @ 2.60GHz (Skylake-SP, 14 cores/socket, 2 sockets)
- L1i cache: 32K per core
- L2 cache: 1024K per core
- L3 cache: 19712K shared
Compiler: gcc -O1
Test Code
Autogenerated C programs with different numbers of lines in a while(1)
loop:
#include <stdio.h>
#include <stdint.h>
// Prevent optimization by making variables volatile
volatile uint64_t a = 1, b = 2, c = 3, d = 4;
volatile uint64_t e = 5, f = 6, g = 7, h = 8;
volatile uint64_t counter = 0;
int main() {
printf("Starting 50000-line arithmetic test\n");
while(1) {
b = e + h;
c = d + e;
f = g + a;
b = d + h;
f = d + d;
d = a + d;
e = a + h;
// ... thousands more lines of similar arithmetic
c = e + b;
h = a + b;
e = c + e;
f = b + d;
if ((a + b + c + d + e + f + g + h) == 0) {
printf("Done\n");
}
}
return 0;
}
Each test contains only arithmetic operations (no branches inside the loop), all using volatile variables to prevent optimization.
Results
Running perf stat
with a 10-second timeout:
Lines | Cycles | IPC | I-Cache Misses | I-Cache Miss Rate |
---|---|---|---|---|
30K | 36.7B | 2.08 | 7.08B | 9.28% |
40K | 36.8B | 1.94 | 6.61B | 9.27% |
50K | 36.6B | 1.59 | 5.38B | 9.27% |
60K | 36.8B | 1.44 | 4.91B | 9.27% |
70K | 36.8B | 1.36 | 4.65B | 9.27% |
100K | 36.7B | 1.30 | 4.43B | 9.27% |
Question
What could be causing this IPC collapse when the loop body grows larger, given that I-cache behavior is identical? Is this hitting some frontend bottleneck in the CPU pipeline? What performance counters would help diagnose the actual bottleneck?
Full perf command:
perf stat -e cycles,instructions,branch-instructions,branch-misses,L1-dcache-load-misses,L1-icache-load-misses,LLC-load-misses,dTLB-load-misses,iTLB-load-misses -- timeout 10s ./test_70k
36,834,726,435 cycles
50,018,331,051 instructions # 1.36 insn per cycle
9,449,823 branch-instructions
321,923 branch-misses # 3.41% of all branches
327,729 L1-dcache-load-misses
4,638,663,649 L1-icache-load-misses
20,629 LLC-load-misses
15,639 dTLB-load-misses
154,250 iTLB-load-misses
perf stat -e cycles,idq_uops_not_delivered.cycles_fe_was_ok,resource_stalls.any,resource_stalls.sb -- timeout 10s ./test_70k
36,759,013,093 cycles
1,489,045,825 idq_uops_not_delivered.cycles_fe_was_ok
14,521,941 resource_stalls.any
3,891,491 resource_stalls.sb
10.001279728 seconds time elapsed
Here is how I generated the code:
operations = ['+']
variables = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
lines = []
for i in range(num_lines):
var1 = random.choice(variables)
var2 = random.choice(variables)
result = random.choice(variables)
op = random.choice(operations)
lines.append(f"{result} = {var1} {op} {var2};")
volatile
in stack space, so the machine instructions will be likemov rax, [rsp+8]
/add rax, [rsp+16]
/mov [rsp+24], rax
for each line, so can in theory run at 3 IPC (2 loads + 1 store) per clock. Except withvolatile
GCC doesn't like to use memory source operands for ALU insns, so probably a separatemov
load, for maybe 4 IPC if the front-end wasn't a bottleneck, and neither is store-forwarding latency.a = b + c
test and still got 1.34 IPC with 70K lines (I used O0 not O1 to avoid optimizing those lines away).