Performance improvement of math.h functions by rewriting with AVX intrinsics

Question

I have a simple math library that gets linked into a project which runs on simulator hardware (32 bit RTOS) and the compiler toolchain is based on a variant of GCC 5.5. The main project code is in Matlab but the core math operations (cmath functions on array data) are re-written in C for performance. Looking at Compiler Explorer, the quality of the optimised code does not seem great for GCC 5.5 32 bit (for reference: Clang trunk 32bit). From what I understand, Clang does a better job optimising the loops. One example code snippet:

...

void cfunctionsLog10(unsigned int n, const double* x, double* y) {
    int i;
    for (i = 0; i < n; i++) {
        y[i] = log10(x[i]);
    }
}

And the corresponding assembly generated by GCC 5.5

cfunctionsLog10(unsigned int, double const*, double*):
        push    ebp
        push    edi
        push    esi
        push    ebx
        sub     esp, 12
        mov     esi, DWORD PTR [esp+32]
        mov     ebp, DWORD PTR [esp+36]
        mov     edi, DWORD PTR [esp+40]
        test    esi, esi
        je      .L28
        xor     ebx, ebx
.L27:
        sub     esp, 8
        push    DWORD PTR [ebp+4+ebx*8]
        push    DWORD PTR [ebp+0+ebx*8]
        call    __log10_finite
        fstp    QWORD PTR [edi+ebx*8]
        add     ebx, 1
        add     esp, 16
        cmp     ebx, esi
        jne     .L27
.L28:
        add     esp, 12
        pop     ebx
        pop     esi
        pop     edi
        pop     ebp
        ret

where Clang produces:

cfunctionsLog10(unsigned int, double const*, double*):              # @cfunctionsLog10(unsigned int, double const*, double*)
        push    ebp
        push    ebx
        push    edi
        push    esi
        sub     esp, 76
        mov     esi, dword ptr [esp + 96]
        test    esi, esi
        je      .LBB2_8
        mov     edi, dword ptr [esp + 104]
        mov     ebx, dword ptr [esp + 100]
        xor     ebp, ebp
        cmp     esi, 4
        jb      .LBB2_7
        lea     eax, [ebx + 8*esi]
        cmp     eax, edi
        jbe     .LBB2_4
        lea     eax, [edi + 8*esi]
        cmp     eax, ebx
        ja      .LBB2_7
.LBB2_4:
        mov     ebp, esi
        xor     esi, esi
        and     ebp, -4
.LBB2_5:                                # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [ebx + 8*esi + 16] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp], xmm0
        vmovsd  xmm0, qword ptr [ebx + 8*esi] # xmm0 = mem[0],zero
        vmovsd  xmm1, qword ptr [ebx + 8*esi + 8] # xmm1 = mem[0],zero
        vmovsd  qword ptr [esp + 8], xmm0 # 8-byte Spill
        vmovsd  qword ptr [esp + 16], xmm1 # 8-byte Spill
        call    log10
        fstp    tbyte ptr [esp + 64]    # 10-byte Folded Spill
        vmovsd  xmm0, qword ptr [esp + 16] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        call    log10
        fstp    tbyte ptr [esp + 16]    # 10-byte Folded Spill
        vmovsd  xmm0, qword ptr [esp + 8] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        vmovsd  xmm0, qword ptr [ebx + 8*esi + 24] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp + 8], xmm0 # 8-byte Spill
        call    log10
        vmovsd  xmm0, qword ptr [esp + 8] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        fstp    qword ptr [esp + 56]
        fld     tbyte ptr [esp + 16]    # 10-byte Folded Reload
        fstp    qword ptr [esp + 48]
        fld     tbyte ptr [esp + 64]    # 10-byte Folded Reload
        fstp    qword ptr [esp + 40]
        call    log10
        fstp    qword ptr [esp + 32]
        vmovsd  xmm0, qword ptr [esp + 56] # xmm0 = mem[0],zero
        vmovsd  xmm1, qword ptr [esp + 40] # xmm1 = mem[0],zero
        vmovhps xmm0, xmm0, qword ptr [esp + 48] # xmm0 = xmm0[0,1],mem[0,1]
        vmovhps xmm1, xmm1, qword ptr [esp + 32] # xmm1 = xmm1[0,1],mem[0,1]
        vmovups xmmword ptr [edi + 8*esi + 16], xmm1
        vmovups xmmword ptr [edi + 8*esi], xmm0
        add     esi, 4
        cmp     ebp, esi
        jne     .LBB2_5
        mov     esi, dword ptr [esp + 96]
        cmp     ebp, esi
        je      .LBB2_8
.LBB2_7:                                # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [ebx + 8*ebp] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp], xmm0
        call    log10
        fstp    qword ptr [edi + 8*ebp]
        inc     ebp
        cmp     esi, ebp
        jne     .LBB2_7
.LBB2_8:
        add     esp, 76
        pop     esi
        pop     edi
        pop     ebx
        pop     ebp
        ret

As I am unable to use Clang directly, is there any value in re-writing the C source using AVX intrinsics. I think the majority of the performance cost comes from the cmath function calls, most of which do not have intrinsic implementations.

EDIT: Reimplemented using vectorclass library:

void vclfunctionsTanh(unsigned int n, const double* x, double* y) 
{
    const int N = n;
    const int VectorSize = 4;
    const int FirstPass = N & (-VectorSize);

    int i = 0;    
    for (; i < FirstPass; i+= 4)
    {
        Vec4d data = Vec4d.load(x[i]);
        Vec4d ans = tanh(data);
        ans.store(y+i);
    }

    
    for (;i < N; ++i)
        y[i]=std::tanh(x[i]);
}

First of all, include some actual code + asm you're talking about in your question, not just via Godbolt shortlinks. (And if you're not including everything, use "full" links to Godbolt to guard against bitrot). Also, use a newer GCC for more efficient auto-vectorization of sqrt, and/or -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store (see Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?). Also maybe try -mfpmath=sse? The calling convention still passed on the stack and returns in x87 though, unless you recompile libm. — Peter Cordes
– Peter Cordes, Commented Jun 11, 2020 at 10:37
If you just need fast approximations of some of those functions, you can manually vectorize them and inline them with intrinsics; that could give a major speedup. Despite the inefficient mixing of AVX and x87 doing store/reload, most of the cost will be inside the math library functions, except for sqrt. So yes, doing 4 doubles in parallel could be at least 4x faster, or more if you trade some precision for speed. (And/or NaN checking; if you know your inputs are finite that can help a lot especially for SIMD where you can't branch per element) — Peter Cordes
– Peter Cordes, Commented Jun 11, 2020 at 10:42
Yes, depending on how you write them, they certainly can. log and exp are relatively easy to vectorize, and a polynomial approximation for the mantissa converges quickly. e.g. Efficient implementation of log2(__m256d) in AVX2 / Fastest Implementation of Exponential Function Using AVX and related Q&As. IDK about cos. For float, you can simply test every possible 32-bit float bit-pattern to be certain about min/max error. — Peter Cordes
– Peter Cordes, Commented Jun 11, 2020 at 11:45
Also there are SIMD math libraries with existing implementations like Where is Clang's '_mm256_pow_ps' intrinsic? / AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? / Mathematical functions for SIMD registers — Peter Cordes
– Peter Cordes, Commented Jun 11, 2020 at 11:46
Your C cfunctionsTanh is a different function from the asm you show, calling cfunctionsLog10. Also, clang's asm is actually shockingly bad, doing tbyte spill/reload of the return value to the stack instead of converting straight into the destination. 10-byte x87 store / reload is much slower than double, and it's totally unnecessary. — Peter Cordes
– Peter Cordes, Commented Jun 11, 2020 at 17:30

A Fog · Accepted Answer · 2020-06-12 12:04:19Z

3

The vector class library has inline vector versions of common mathematical functions, including log10.

https://github.com/vectorclass/

answered Jun 12, 2020 at 12:04

A Fog

4,6661 gold badge34 silver badges34 bronze badges

The documentation recommends not compiling with -ffast-math, is this true for cases where we know that we deal only with finite numbers and operations that don't overflow?

Mansoor
– Mansoor

06/20/2020 11:49:45
Commented Jun 20, 2020 at 11:49
1

-ffast-math makes it impossible to detect NAN results in most cases. If you do not want to check for NANs anyway then you can use -ffast-math.

A Fog
– A Fog

06/21/2020 14:58:34
Commented Jun 21, 2020 at 14:58

Add a comment |

Collectives™ on Stack Overflow

Performance improvement of math.h functions by rewriting with AVX intrinsics

1 Answer 1

Your Answer

Linked

Hot Network Questions

Collectives™ on Stack Overflow

1 Answer 1

Your Answer

Sign up or log in

Post as a guest

Linked

Related