IPC collapse with larger loop bodies despite constant I-cache miss rate, what's the bottleneck?

Ask Question

Asked 10 days ago

Modified 10 days ago

Viewed 101 times

I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant at 9%. What CPU frontend bottleneck could cause this?

Results

Lines	IPC	I-Cache Miss Rate
30K	2.08	9.28%
70K	1.36	9.27%
100K	1.30	9.27%

Since I-cache behavior is identical, this isn't a cache issue. What frontend limitation (decode bandwidth, micro-op cache, fetch bandwidth) could explain this?

Setup

CPU: Intel Xeon Gold 6132 @ 2.60GHz (Skylake-SP, 14 cores/socket, 2 sockets)

L1i cache: 32K per core
L2 cache: 1024K per core
L3 cache: 19712K shared

Compiler: gcc -O1

Test Code

Autogenerated C programs with different numbers of lines in a while(1) loop:

#include <stdio.h>
#include <stdint.h>

// Prevent optimization by making variables volatile
volatile uint64_t a = 1, b = 2, c = 3, d = 4;
volatile uint64_t e = 5, f = 6, g = 7, h = 8;
volatile uint64_t counter = 0;

int main() {
    printf("Starting 50000-line arithmetic test\n");
    while(1) {
        b = e + h;
        c = d + e;
        f = g + a;
        b = d + h;
        f = d + d;
        d = a + d;
        e = a + h;
        // ... thousands more lines of similar arithmetic
        c = e + b;
        h = a + b;
        e = c + e;
        f = b + d;
        if ((a + b + c + d + e + f + g + h) == 0) {
            printf("Done\n");
        }
    }
    return 0;
}

Each test contains only arithmetic operations (no branches inside the loop), all using volatile variables to prevent optimization.

Results

Running perf stat with a 10-second timeout:

Lines	Cycles	IPC	I-Cache Misses	I-Cache Miss Rate
30K	36.7B	2.08	7.08B	9.28%
40K	36.8B	1.94	6.61B	9.27%
50K	36.6B	1.59	5.38B	9.27%
60K	36.8B	1.44	4.91B	9.27%
70K	36.8B	1.36	4.65B	9.27%
100K	36.7B	1.30	4.43B	9.27%

Question

What could be causing this IPC collapse when the loop body grows larger, given that I-cache behavior is identical? Is this hitting some frontend bottleneck in the CPU pipeline? What performance counters would help diagnose the actual bottleneck?

Full perf command:

perf stat -e cycles,instructions,branch-instructions,branch-misses,L1-dcache-load-misses,L1-icache-load-misses,LLC-load-misses,dTLB-load-misses,iTLB-load-misses -- timeout 10s ./test_70k

    36,834,726,435      cycles
    50,018,331,051      instructions              #    1.36  insn per cycle
         9,449,823      branch-instructions
           321,923      branch-misses             #    3.41% of all branches
           327,729      L1-dcache-load-misses
     4,638,663,649      L1-icache-load-misses
            20,629      LLC-load-misses
            15,639      dTLB-load-misses
           154,250      iTLB-load-misses

perf stat -e cycles,idq_uops_not_delivered.cycles_fe_was_ok,resource_stalls.any,resource_stalls.sb -- timeout 10s ./test_70k

    36,759,013,093      cycles
     1,489,045,825      idq_uops_not_delivered.cycles_fe_was_ok
        14,521,941      resource_stalls.any
         3,891,491      resource_stalls.sb

      10.001279728 seconds time elapsed

Here is how I generated the code:

    operations = ['+']
    variables = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
    
    lines = []
    for i in range(num_lines):
        var1 = random.choice(variables)
        var2 = random.choice(variables)
        result = random.choice(variables)
        op = random.choice(operations)
        lines.append(f"{result} = {var1} {op} {var2};")

edited Aug 26 at 12:31

asked Aug 26 at 1:47

Transcendental

9693 gold badges8 silver badges25 bronze badges

So you're intentionally testing with a loop much larger than a normal loop, so it won't fit in the uop cache at all. And everything is volatile in stack space, so the machine instructions will be like mov rax, [rsp+8] / add rax, [rsp+16] / mov [rsp+24], rax for each line, so can in theory run at 3 IPC (2 loads + 1 store) per clock. Except with volatile GCC doesn't like to use memory source operands for ALU insns, so probably a separate mov load, for maybe 4 IPC if the front-end wasn't a bottleneck, and neither is store-forwarding latency.

Peter Cordes
– Peter Cordes

2025-08-26 03:09:12 +00:00
Commented Aug 26 at 3:09
1

I just ran the a = b + c test and still got 1.34 IPC with 70K lines (I used O0 not O1 to avoid optimizing those lines away).

Transcendental
– Transcendental

2025-08-26 12:36:55 +00:00
Commented Aug 26 at 12:36
1

Thanks for taking a look at this Peter! So my conclusion from this is the following: I-cache misses actually don't kill the performance here, but L3 to L2 instruction fetching does. We should try keeping hotpath instruction count under the L2 cache. If we want to get the same work done faster (higher IPC), we've got to shrink the instruction count to keep things cache-friendly. Stuff like forced noinline functions or forced rolled loops is really our only option here; there's no magic bullet to speed up this kind of workload other than making it fit better in cache.

Transcendental
– Transcendental

2025-08-26 19:38:37 +00:00
Commented Aug 26 at 19:38
1

Yeah, a >1MiB loop is ridiculous. Aim for loops to fit in the uop cache, at most 1536 uops on Skylake. Or at worst in L2 cache, since bandwidth from there is only a bit worse (13.5 B/cycle) than from L1i (16 B/cycle). But most loops also need data hot in L2 cache. If instruction lengths were short enough, like if all your variables were in regs, code-fetch bandwidth would be less of a bottleneck. e.g. 4 IPC is possible with 4-byte instructions at 16 bytes per cycle. (Skylake's pipeline is only 4 uops wide; Ice Lake and Zen widen to 5, Alder Lake and Zen 3 widen to 6.)

Peter Cordes
– Peter Cordes

2025-08-26 20:30:29 +00:00
Commented Aug 26 at 20:30
1

Unrolling tiny loops is often good if they're long-running, but yeah, loop size is ideally under 1536 uops (or less since packing is almost never perfect, with 32-byte machine-code boundaries ending uop-cache lines.)

Peter Cordes
– Peter Cordes

2025-08-26 20:32:11 +00:00
Commented Aug 26 at 20:32

| Show 25 more comments

0 Your Answer

Sign up or log in

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.

Collectives™ on Stack Overflow

IPC collapse with larger loop bodies despite constant I-cache miss rate, what's the bottleneck?

Results

Setup

Test Code

Results

Question

0

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

Results

Setup

Test Code

Results

Question

0

Know someone who can answer? Share a link to this question via email, Twitter, or Facebook.

Your Answer

Sign up or log in

Post as a guest

Linked