2

I have a simple math library that gets linked into a project which runs on simulator hardware (32 bit RTOS) and the compiler toolchain is based on a variant of GCC 5.5. The main project code is in Matlab but the core math operations (cmath functions on array data) are re-written in C for performance. Looking at Compiler Explorer, the quality of the optimised code does not seem great for GCC 5.5 32 bit (for reference: Clang trunk 32bit). From what I understand, Clang does a better job optimising the loops. One example code snippet:

...

void cfunctionsLog10(unsigned int n, const double* x, double* y) {
    int i;
    for (i = 0; i < n; i++) {
        y[i] = log10(x[i]);
    }
}

And the corresponding assembly generated by GCC 5.5

cfunctionsLog10(unsigned int, double const*, double*):
        push    ebp
        push    edi
        push    esi
        push    ebx
        sub     esp, 12
        mov     esi, DWORD PTR [esp+32]
        mov     ebp, DWORD PTR [esp+36]
        mov     edi, DWORD PTR [esp+40]
        test    esi, esi
        je      .L28
        xor     ebx, ebx
.L27:
        sub     esp, 8
        push    DWORD PTR [ebp+4+ebx*8]
        push    DWORD PTR [ebp+0+ebx*8]
        call    __log10_finite
        fstp    QWORD PTR [edi+ebx*8]
        add     ebx, 1
        add     esp, 16
        cmp     ebx, esi
        jne     .L27
.L28:
        add     esp, 12
        pop     ebx
        pop     esi
        pop     edi
        pop     ebp
        ret

where Clang produces:

cfunctionsLog10(unsigned int, double const*, double*):              # @cfunctionsLog10(unsigned int, double const*, double*)
        push    ebp
        push    ebx
        push    edi
        push    esi
        sub     esp, 76
        mov     esi, dword ptr [esp + 96]
        test    esi, esi
        je      .LBB2_8
        mov     edi, dword ptr [esp + 104]
        mov     ebx, dword ptr [esp + 100]
        xor     ebp, ebp
        cmp     esi, 4
        jb      .LBB2_7
        lea     eax, [ebx + 8*esi]
        cmp     eax, edi
        jbe     .LBB2_4
        lea     eax, [edi + 8*esi]
        cmp     eax, ebx
        ja      .LBB2_7
.LBB2_4:
        mov     ebp, esi
        xor     esi, esi
        and     ebp, -4
.LBB2_5:                                # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [ebx + 8*esi + 16] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp], xmm0
        vmovsd  xmm0, qword ptr [ebx + 8*esi] # xmm0 = mem[0],zero
        vmovsd  xmm1, qword ptr [ebx + 8*esi + 8] # xmm1 = mem[0],zero
        vmovsd  qword ptr [esp + 8], xmm0 # 8-byte Spill
        vmovsd  qword ptr [esp + 16], xmm1 # 8-byte Spill
        call    log10
        fstp    tbyte ptr [esp + 64]    # 10-byte Folded Spill
        vmovsd  xmm0, qword ptr [esp + 16] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        call    log10
        fstp    tbyte ptr [esp + 16]    # 10-byte Folded Spill
        vmovsd  xmm0, qword ptr [esp + 8] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        vmovsd  xmm0, qword ptr [ebx + 8*esi + 24] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp + 8], xmm0 # 8-byte Spill
        call    log10
        vmovsd  xmm0, qword ptr [esp + 8] # 8-byte Reload
        vmovsd  qword ptr [esp], xmm0
        fstp    qword ptr [esp + 56]
        fld     tbyte ptr [esp + 16]    # 10-byte Folded Reload
        fstp    qword ptr [esp + 48]
        fld     tbyte ptr [esp + 64]    # 10-byte Folded Reload
        fstp    qword ptr [esp + 40]
        call    log10
        fstp    qword ptr [esp + 32]
        vmovsd  xmm0, qword ptr [esp + 56] # xmm0 = mem[0],zero
        vmovsd  xmm1, qword ptr [esp + 40] # xmm1 = mem[0],zero
        vmovhps xmm0, xmm0, qword ptr [esp + 48] # xmm0 = xmm0[0,1],mem[0,1]
        vmovhps xmm1, xmm1, qword ptr [esp + 32] # xmm1 = xmm1[0,1],mem[0,1]
        vmovups xmmword ptr [edi + 8*esi + 16], xmm1
        vmovups xmmword ptr [edi + 8*esi], xmm0
        add     esi, 4
        cmp     ebp, esi
        jne     .LBB2_5
        mov     esi, dword ptr [esp + 96]
        cmp     ebp, esi
        je      .LBB2_8
.LBB2_7:                                # =>This Inner Loop Header: Depth=1
        vmovsd  xmm0, qword ptr [ebx + 8*ebp] # xmm0 = mem[0],zero
        vmovsd  qword ptr [esp], xmm0
        call    log10
        fstp    qword ptr [edi + 8*ebp]
        inc     ebp
        cmp     esi, ebp
        jne     .LBB2_7
.LBB2_8:
        add     esp, 76
        pop     esi
        pop     edi
        pop     ebx
        pop     ebp
        ret

As I am unable to use Clang directly, is there any value in re-writing the C source using AVX intrinsics. I think the majority of the performance cost comes from the cmath function calls, most of which do not have intrinsic implementations.


EDIT: Reimplemented using vectorclass library:

void vclfunctionsTanh(unsigned int n, const double* x, double* y) 
{
    const int N = n;
    const int VectorSize = 4;
    const int FirstPass = N & (-VectorSize);

    int i = 0;    
    for (; i < FirstPass; i+= 4)
    {
        Vec4d data = Vec4d.load(x[i]);
        Vec4d ans = tanh(data);
        ans.store(y+i);
    }

    
    for (;i < N; ++i)
        y[i]=std::tanh(x[i]);
}

12
  • 1
    First of all, include some actual code + asm you're talking about in your question, not just via Godbolt shortlinks. (And if you're not including everything, use "full" links to Godbolt to guard against bitrot). Also, use a newer GCC for more efficient auto-vectorization of sqrt, and/or -mno-avx256-split-unaligned-load -mno-avx256-split-unaligned-store (see Why doesn't gcc resolve _mm256_loadu_pd as single vmovupd?). Also maybe try -mfpmath=sse? The calling convention still passed on the stack and returns in x87 though, unless you recompile libm. Commented Jun 11, 2020 at 10:37
  • 1
    If you just need fast approximations of some of those functions, you can manually vectorize them and inline them with intrinsics; that could give a major speedup. Despite the inefficient mixing of AVX and x87 doing store/reload, most of the cost will be inside the math library functions, except for sqrt. So yes, doing 4 doubles in parallel could be at least 4x faster, or more if you trade some precision for speed. (And/or NaN checking; if you know your inputs are finite that can help a lot especially for SIMD where you can't branch per element) Commented Jun 11, 2020 at 10:42
  • 1
    Yes, depending on how you write them, they certainly can. log and exp are relatively easy to vectorize, and a polynomial approximation for the mantissa converges quickly. e.g. Efficient implementation of log2(__m256d) in AVX2 / Fastest Implementation of Exponential Function Using AVX and related Q&As. IDK about cos. For float, you can simply test every possible 32-bit float bit-pattern to be certain about min/max error. Commented Jun 11, 2020 at 11:45
  • 2
    Also there are SIMD math libraries with existing implementations like Where is Clang's '_mm256_pow_ps' intrinsic? / AVX log intrinsics (_mm256_log_ps) missing in g++-4.8? / Mathematical functions for SIMD registers Commented Jun 11, 2020 at 11:46
  • 1
    Your C cfunctionsTanh is a different function from the asm you show, calling cfunctionsLog10. Also, clang's asm is actually shockingly bad, doing tbyte spill/reload of the return value to the stack instead of converting straight into the destination. 10-byte x87 store / reload is much slower than double, and it's totally unnecessary. Commented Jun 11, 2020 at 17:30

1 Answer 1

3

The vector class library has inline vector versions of common mathematical functions, including log10.

https://github.com/vectorclass/

2
  • The documentation recommends not compiling with -ffast-math, is this true for cases where we know that we deal only with finite numbers and operations that don't overflow? Commented Jun 20, 2020 at 11:49
  • 1
    -ffast-math makes it impossible to detect NAN results in most cases. If you do not want to check for NANs anyway then you can use -ffast-math. Commented Jun 21, 2020 at 14:58

Your Answer

By clicking β€œPost Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.