Skip to main content
Filter by
Sorted by
Tagged with
4 votes
0 answers
110 views

"Repacking" 64 to 40 bits in AVX2

I have an array of 64-bit words, which I need to compact by removing the most significant 24 bits of each 64-bit word. Essentially this code in pure C: void repack1(uint8_t* out, uint64_t* in) { ...
swineone's user avatar
  • 2,960
3 votes
0 answers
63 views

CUDA: Load misaligned float4 vector

I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this? Specific conditions: I cannot align the array without replicating the data because other ...
Homer512's user avatar
  • 14.9k
0 votes
1 answer
135 views

Why is `JArray.ToObject<List<T>>` faster than `JArray.ToObject<T[]>`

We are trying to micro-optimise some parts of our production code and I was expecting that using Arrays would generally be better than Lists in .NET, and most of the times, my benchmarks indicate that....
FluidMechanics Potential Flows's user avatar
29 votes
1 answer
4k views

Why do C compilers still prefer push over mov for saving registers, even when mov appears faster in llvm-mca?

I noticed that modern C compilers typically use push instructions to save caller-saved registers, rather than explicit mov + sub sequences. However, based on llvm-mca simulations, the mov approach ...
Moi5t's user avatar
  • 465
14 votes
3 answers
988 views

Efficient extraction of first/only key in a dictionary

Assumption is that we have a dictionary containing exactly one key/value pair. Objective is to extract the only key. I can think of four ways to do this (there may be more). import timeit def func1(d):...
Ramrab's user avatar
  • 28.7k
1 vote
2 answers
127 views

68000 assembly – two-pass routine to find the average of A and count elements above it

Background For practice I wrote a 68000 sub-routine that computes the integer average of an n-element signed-word array A, and counts how many elements of A are strictly greater than that average. ...
Pato's user avatar
  • 1
2 votes
1 answer
122 views

68000 Assembly – one-pass swap-and-sum of two word vectors (can it be done better?)

Task I’m practising fixed-size loops on a Motorola 68000. Given two 10-word signed arrays A and B, I need to swap each pair (A[i] ↔ B[i]) in place, and build a third array C where C[i] = A[i] + B[i] (...
Pato's user avatar
  • 1
0 votes
1 answer
154 views

68000 Assembly – Is branchless code faster for counting signed compare conditions?

Context I have a fixed dataset of 10 signed words in A and B, plus two signed thresholds a and b. In one pass I need to store C[i] = A[i] + B[i] accumulate CONT1 += (A[i] > b) accumulate CONT2 += (...
Pato's user avatar
  • 1
2 votes
2 answers
94 views

68000 Assembly – Reverse Array A into B via Stack Parameters

Task Develop a Motorola 68000 program that Reverses the order of elements in an input array A Copies the result into a second array B All logic must live in one subroutine (or more) with parameters ...
Pato's user avatar
  • 1
0 votes
0 answers
86 views

AVX2 cross-lane shuffles

I'm writing some AVX2 code that is very permutation-heavy. The main permutation instructions used are unpacks, VPSHUFB, some uses of VPERM2I128 and a few of VPBLENDW. After puzzling over the ...
swineone's user avatar
  • 2,960
4 votes
1 answer
138 views

68000 Assembly – Build a String from Characters *not* Present in Another & Return Its Length (stack-passed params)

Task Develop a program for the Motorola 68000 that Creates a string C containing every character from A that does not appear anywhere in B. Computes the length of this new string C. All logic must ...
Pato's user avatar
  • 1
2 votes
2 answers
144 views

Fast combined element-wise minimum/maximum for 64-bit signed integers in AVX2

I need to compute the element-wise minima and maxima of two arrays of 64-bit signed integers, using AVX2. Target is Golden Cove/Raptor Cove (Intel 12th/13th generation P-core). AVX2 has minimum and ...
swineone's user avatar
  • 2,960
17 votes
7 answers
684 views

Efficient AVX2 implementation of a 17x17-bit squaring operation with result truncation

I am trying to create an efficient AVX2 implementation for a 17x17-bit truncated squarer that returns the 15 most significant bits of the result. This operation appears as a building block in ...
njuffa's user avatar
  • 26.8k
9 votes
3 answers
324 views

3D Morton code computation utilizing carry-less multiplication

This question arose specifically in the context of exploratory work for RISC-V platforms which may optionally support a carry-less multiplication instruction CLMUL. It would equally apply to other ...
njuffa's user avatar
  • 26.8k
5 votes
3 answers
329 views

efficient check whether unsigned integer value belongs to either of two compile-time constant intervals

In various contexts I have faced the issue of determining whether a given unsigned integer value belongs to either one of two non-overlapping intervals, and not infrequently these checks introduce ...
njuffa's user avatar
  • 26.8k

15 30 50 per page
1
2 3 4 5
…
63