3

I'm seeing dramatic instructions-per-cycle collapse (2.08 -> 1.30) when increasing loop body size in simple arithmetic code with no branches, but instruction cache miss rate stays exactly constant at 9%. What CPU frontend bottleneck could cause this?

Results

Lines IPC I-Cache Miss Rate
30K 2.08 9.28%
70K 1.36 9.27%
100K 1.30 9.27%

Since I-cache behavior is identical, this isn't a cache issue. What frontend limitation (decode bandwidth, micro-op cache, fetch bandwidth) could explain this?


Setup

CPU: Intel Xeon Gold 6132 @ 2.60GHz (Skylake-SP, 14 cores/socket, 2 sockets)

  • L1i cache: 32K per core
  • L2 cache: 1024K per core
  • L3 cache: 19712K shared

Compiler: gcc -O1

Test Code

Autogenerated C programs with different numbers of lines in a while(1) loop:

#include <stdio.h>
#include <stdint.h>

// Prevent optimization by making variables volatile
volatile uint64_t a = 1, b = 2, c = 3, d = 4;
volatile uint64_t e = 5, f = 6, g = 7, h = 8;
volatile uint64_t counter = 0;

int main() {
    printf("Starting 50000-line arithmetic test\n");
    while(1) {
        b = e + h;
        c = d + e;
        f = g + a;
        b = d + h;
        f = d + d;
        d = a + d;
        e = a + h;
        // ... thousands more lines of similar arithmetic
        c = e + b;
        h = a + b;
        e = c + e;
        f = b + d;
        if ((a + b + c + d + e + f + g + h) == 0) {
            printf("Done\n");
        }
    }
    return 0;
}

Each test contains only arithmetic operations (no branches inside the loop), all using volatile variables to prevent optimization.

Results

Running perf stat with a 10-second timeout:

Lines Cycles IPC I-Cache Misses I-Cache Miss Rate
30K 36.7B 2.08 7.08B 9.28%
40K 36.8B 1.94 6.61B 9.27%
50K 36.6B 1.59 5.38B 9.27%
60K 36.8B 1.44 4.91B 9.27%
70K 36.8B 1.36 4.65B 9.27%
100K 36.7B 1.30 4.43B 9.27%

Question

What could be causing this IPC collapse when the loop body grows larger, given that I-cache behavior is identical? Is this hitting some frontend bottleneck in the CPU pipeline? What performance counters would help diagnose the actual bottleneck?

Full perf command:

perf stat -e cycles,instructions,branch-instructions,branch-misses,L1-dcache-load-misses,L1-icache-load-misses,LLC-load-misses,dTLB-load-misses,iTLB-load-misses -- timeout 10s ./test_70k
    36,834,726,435      cycles
    50,018,331,051      instructions              #    1.36  insn per cycle
         9,449,823      branch-instructions
           321,923      branch-misses             #    3.41% of all branches
           327,729      L1-dcache-load-misses
     4,638,663,649      L1-icache-load-misses
            20,629      LLC-load-misses
            15,639      dTLB-load-misses
           154,250      iTLB-load-misses
perf stat -e cycles,idq_uops_not_delivered.cycles_fe_was_ok,resource_stalls.any,resource_stalls.sb -- timeout 10s ./test_70k
    36,759,013,093      cycles
     1,489,045,825      idq_uops_not_delivered.cycles_fe_was_ok
        14,521,941      resource_stalls.any
         3,891,491      resource_stalls.sb

      10.001279728 seconds time elapsed

Here is how I generated the code:

    operations = ['+']
    variables = ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
    
    lines = []
    for i in range(num_lines):
        var1 = random.choice(variables)
        var2 = random.choice(variables)
        result = random.choice(variables)
        op = random.choice(operations)
        lines.append(f"{result} = {var1} {op} {var2};")
30
  • So you're intentionally testing with a loop much larger than a normal loop, so it won't fit in the uop cache at all. And everything is volatile in stack space, so the machine instructions will be like mov rax, [rsp+8] / add rax, [rsp+16] / mov [rsp+24], rax for each line, so can in theory run at 3 IPC (2 loads + 1 store) per clock. Except with volatile GCC doesn't like to use memory source operands for ALU insns, so probably a separate mov load, for maybe 4 IPC if the front-end wasn't a bottleneck, and neither is store-forwarding latency. Commented Aug 26 at 3:09
  • 1
    I just ran the a = b + c test and still got 1.34 IPC with 70K lines (I used O0 not O1 to avoid optimizing those lines away). Commented Aug 26 at 12:36
  • 1
    Thanks for taking a look at this Peter! So my conclusion from this is the following: I-cache misses actually don't kill the performance here, but L3 to L2 instruction fetching does. We should try keeping hotpath instruction count under the L2 cache. If we want to get the same work done faster (higher IPC), we've got to shrink the instruction count to keep things cache-friendly. Stuff like forced noinline functions or forced rolled loops is really our only option here; there's no magic bullet to speed up this kind of workload other than making it fit better in cache. Commented Aug 26 at 19:38
  • 1
    Yeah, a >1MiB loop is ridiculous. Aim for loops to fit in the uop cache, at most 1536 uops on Skylake. Or at worst in L2 cache, since bandwidth from there is only a bit worse (13.5 B/cycle) than from L1i (16 B/cycle). But most loops also need data hot in L2 cache. If instruction lengths were short enough, like if all your variables were in regs, code-fetch bandwidth would be less of a bottleneck. e.g. 4 IPC is possible with 4-byte instructions at 16 bytes per cycle. (Skylake's pipeline is only 4 uops wide; Ice Lake and Zen widen to 5, Alder Lake and Zen 3 widen to 6.) Commented Aug 26 at 20:30
  • 1
    Unrolling tiny loops is often good if they're long-running, but yeah, loop size is ideally under 1536 uops (or less since packing is almost never perfect, with 32-byte machine-code boundaries ending uop-cache lines.) Commented Aug 26 at 20:32

0

Your Answer

By clicking β€œPost Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.