Newest 'micro-optimization' Questions

4 votes

0 answers

110 views

"Repacking" 64 to 40 bits in AVX2

I have an array of 64-bit words, which I need to compact by removing the most significant 24 bits of each 64-bit word. Essentially this code in pure C: void repack1(uint8_t* out, uint64_t* in) { ...

swineone

2,960

asked Sep 2 at 2:05

3 votes

0 answers

63 views

CUDA: Load misaligned float4 vector

I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this? Specific conditions: I cannot align the array without replicating the data because other ...

Homer512

14.9k

asked Aug 22 at 12:53

0 votes

1 answer

135 views

Why is `JArray.ToObject<List<T>>` faster than `JArray.ToObject<T[]>`

We are trying to micro-optimise some parts of our production code and I was expecting that using Arrays would generally be better than Lists in .NET, and most of the times, my benchmarks indicate that....

FluidMechanics Potential Flows

666

asked Aug 2 at 11:10

29 votes

1 answer

4k views

Why do C compilers still prefer push over mov for saving registers, even when mov appears faster in llvm-mca?

I noticed that modern C compilers typically use push instructions to save caller-saved registers, rather than explicit mov + sub sequences. However, based on llvm-mca simulations, the mov approach ...

Moi5t

465

asked Jul 29 at 4:12

14 votes

3 answers

988 views

Efficient extraction of first/only key in a dictionary

Assumption is that we have a dictionary containing exactly one key/value pair. Objective is to extract the only key. I can think of four ways to do this (there may be more). import timeit def func1(d):...

Ramrab

28.7k

asked Jul 27 at 9:15

1 vote

2 answers

127 views

68000 assembly – two-pass routine to find the average of A and count elements above it

Background For practice I wrote a 68000 sub-routine that computes the integer average of an n-element signed-word array A, and counts how many elements of A are strictly greater than that average. ...

Pato

1

asked Jul 10 at 15:54

2 votes

1 answer

122 views

68000 Assembly – one-pass swap-and-sum of two word vectors (can it be done better?)

Task I’m practising fixed-size loops on a Motorola 68000. Given two 10-word signed arrays A and B, I need to swap each pair (A[i] ↔ B[i]) in place, and build a third array C where C[i] = A[i] + B[i] (...

Pato

1

asked Jul 3 at 21:49

0 votes

1 answer

154 views

68000 Assembly – Is branchless code faster for counting signed compare conditions?

Context I have a fixed dataset of 10 signed words in A and B, plus two signed thresholds a and b. In one pass I need to store C[i] = A[i] + B[i] accumulate CONT1 += (A[i] > b) accumulate CONT2 += (...

Pato

1

asked Jul 1 at 13:09

2 votes

2 answers

94 views

68000 Assembly – Reverse Array A into B via Stack Parameters

Task Develop a Motorola 68000 program that Reverses the order of elements in an input array A Copies the result into a second array B All logic must live in one subroutine (or more) with parameters ...

Pato

1

asked Jun 28 at 20:46

0 votes

0 answers

86 views

AVX2 cross-lane shuffles

I'm writing some AVX2 code that is very permutation-heavy. The main permutation instructions used are unpacks, VPSHUFB, some uses of VPERM2I128 and a few of VPBLENDW. After puzzling over the ...

swineone

2,960

asked Jun 26 at 3:55

4 votes

1 answer

138 views

68000 Assembly – Build a String from Characters not Present in Another & Return Its Length (stack-passed params)

Task Develop a program for the Motorola 68000 that Creates a string C containing every character from A that does not appear anywhere in B. Computes the length of this new string C. All logic must ...

Pato

1

asked Jun 24 at 13:34

2 votes

2 answers

144 views

Fast combined element-wise minimum/maximum for 64-bit signed integers in AVX2

I need to compute the element-wise minima and maxima of two arrays of 64-bit signed integers, using AVX2. Target is Golden Cove/Raptor Cove (Intel 12th/13th generation P-core). AVX2 has minimum and ...

swineone

2,960

asked Jun 16 at 2:20

17 votes

7 answers

684 views

Efficient AVX2 implementation of a 17x17-bit squaring operation with result truncation

I am trying to create an efficient AVX2 implementation for a 17x17-bit truncated squarer that returns the 15 most significant bits of the result. This operation appears as a building block in ...

njuffa

26.8k

asked May 31 at 23:52

9 votes

3 answers

324 views

3D Morton code computation utilizing carry-less multiplication

This question arose specifically in the context of exploratory work for RISC-V platforms which may optionally support a carry-less multiplication instruction CLMUL. It would equally apply to other ...

njuffa

26.8k

asked Apr 28 at 4:09

5 votes

3 answers

329 views

efficient check whether unsigned integer value belongs to either of two compile-time constant intervals

In various contexts I have faced the issue of determining whether a given unsigned integer value belongs to either one of two non-overlapping intervals, and not infrequently these checks introduce ...

njuffa

26.8k

asked Apr 21 at 22:25

Collectives™ on Stack Overflow

"Repacking" 64 to 40 bits in AVX2

CUDA: Load misaligned float4 vector

Why is `JArray.ToObject<List<T>>` faster than `JArray.ToObject<T[]>`

Why do C compilers still prefer push over mov for saving registers, even when mov appears faster in llvm-mca?

Efficient extraction of first/only key in a dictionary

68000 assembly – two-pass routine to find the average of A and count elements above it

68000 Assembly – one-pass swap-and-sum of two word vectors (can it be done better?)

68000 Assembly – Is branchless code faster for counting signed compare conditions?

68000 Assembly – Reverse Array A into B via Stack Parameters

AVX2 cross-lane shuffles

68000 Assembly – Build a String from Characters not Present in Another & Return Its Length (stack-passed params)

Fast combined element-wise minimum/maximum for 64-bit signed integers in AVX2

Efficient AVX2 implementation of a 17x17-bit squaring operation with result truncation

3D Morton code computation utilizing carry-less multiplication

efficient check whether unsigned integer value belongs to either of two compile-time constant intervals

Hot Network Questions