944 questions
4
votes
0
answers
110
views
"Repacking" 64 to 40 bits in AVX2
I have an array of 64-bit words, which I need to compact by removing the most significant 24 bits of each 64-bit word. Essentially this code in pure C:
void repack1(uint8_t* out, uint64_t* in) {
...
3
votes
0
answers
63
views
CUDA: Load misaligned float4 vector
I want to load 4 floats per thread. I know they are not 16 byte aligned. What is the best way to do this?
Specific conditions:
I cannot align the array without replicating the data because other ...
0
votes
1
answer
135
views
Why is `JArray.ToObject<List<T>>` faster than `JArray.ToObject<T[]>`
We are trying to micro-optimise some parts of our production code and I was expecting that using Arrays would generally be better than Lists in .NET, and most of the times, my benchmarks indicate that....
29
votes
1
answer
4k
views
Why do C compilers still prefer push over mov for saving registers, even when mov appears faster in llvm-mca?
I noticed that modern C compilers typically use push instructions to save caller-saved registers, rather than explicit mov + sub sequences. However, based on llvm-mca simulations, the mov approach ...
14
votes
3
answers
988
views
Efficient extraction of first/only key in a dictionary
Assumption is that we have a dictionary containing exactly one key/value pair. Objective is to extract the only key.
I can think of four ways to do this (there may be more).
import timeit
def func1(d):...
1
vote
2
answers
127
views
68000 assembly β two-pass routine to find the average of A and count elements above it
Background
For practice I wrote a 68000 sub-routine that
computes the integer average of an n-element signed-word array A, and
counts how many elements of A are strictly greater than that average.
...
2
votes
1
answer
122
views
68000 Assembly β one-pass swap-and-sum of two word vectors (can it be done better?)
Task
Iβm practising fixed-size loops on a Motorola 68000.
Given two 10-word signed arrays A and B, I need to
swap each pair (A[i] β B[i]) in place, and
build a third array C where C[i] = A[i] + B[i] (...
0
votes
1
answer
154
views
68000 Assembly β Is branchless code faster for counting signed compare conditions?
Context
I have a fixed dataset of 10 signed words in A and B, plus two signed thresholds a and b.
In one pass I need to
store C[i] = A[i] + B[i]
accumulate CONT1 += (A[i] > b)
accumulate CONT2 += (...
2
votes
2
answers
94
views
68000 Assembly β Reverse Array A into B via Stack Parameters
Task
Develop a Motorola 68000 program that
Reverses the order of elements in an input array A
Copies the result into a second array B
All logic must live in one subroutine (or more) with parameters ...
0
votes
0
answers
86
views
AVX2 cross-lane shuffles
I'm writing some AVX2 code that is very permutation-heavy. The main permutation instructions used are unpacks, VPSHUFB, some uses of VPERM2I128 and a few of VPBLENDW.
After puzzling over the ...
4
votes
1
answer
138
views
68000 Assembly β Build a String from Characters *not* Present in Another & Return Its Length (stack-passed params)
Task
Develop a program for the Motorola 68000 that
Creates a string C containing every character from A that does not appear anywhere in B.
Computes the length of this new string C.
All logic must ...
2
votes
2
answers
144
views
Fast combined element-wise minimum/maximum for 64-bit signed integers in AVX2
I need to compute the element-wise minima and maxima of two arrays of 64-bit signed integers, using AVX2. Target is Golden Cove/Raptor Cove (Intel 12th/13th generation P-core).
AVX2 has minimum and ...
17
votes
7
answers
684
views
Efficient AVX2 implementation of a 17x17-bit squaring operation with result truncation
I am trying to create an efficient AVX2 implementation for a 17x17-bit truncated squarer that returns the 15 most significant bits of the result. This operation appears as a building block in ...
9
votes
3
answers
324
views
3D Morton code computation utilizing carry-less multiplication
This question arose specifically in the context of exploratory work for RISC-V platforms which may optionally support a carry-less multiplication instruction CLMUL. It would equally apply to other ...
5
votes
3
answers
329
views
efficient check whether unsigned integer value belongs to either of two compile-time constant intervals
In various contexts I have faced the issue of determining whether a given unsigned integer value belongs to either one of two non-overlapping intervals, and not infrequently these checks introduce ...