So I want to store all these constants in an one place.
But there is a problem: I perform checking of existed CPU extension in run time.
If the CPU doesn't support for example SSE (or AVX) than will be a program crash during constants initialization.
So is it possible to initialize these constants without using of SSE?
If the M128 union is defined locally where you use the loop, this should have no performance overhead (it will be loaded in memory once at the begin of the loop). Because it contains a variable of type __m128i, M128 inherits the correct alignment.
M128 k8 = ...;
// use k8.i128 in your for loop
If it is defined somewhere else, then you need to copy into a local register before you start the loop, otherwise the compiler may not be able to optimize it.
__m128i tmp = k8.i128;
// for loop here
This will load k8 into a cpu register and keep it there for the duration of the loop, as long as there enough free registers to carry out the loop body.
Depending on what compiler you use, these unions may be already defined (VS does), but the compiler's provided definitions may not be portable.
You usually don't need this. Compilers are very good at using the same storage for multiple functions that use the same constant. Just like merging multiple instances of the same string literal into one string constant, multiple instances of the same _mm_set* in different functions will all load from the same vector constant (or generate on the fly for _mm_setzero_si128() or _mm_set1_epi8(-1)).
Using Godbolt's binary output (disassembly) mode lets you see whether different functions are loading from the same block of memory or not. Look at the comment it adds, which resolves the RIP-relative addresses to absolute addresses.
clang: identical constants share storage. 16B and 32B constants don't overlap, even when one is a subset of the other. Some functions using repetitive constants use an AVX2 vpbroadcastd broadcast-load (which doesn't even take an ALU uop on Intel SnB-family CPUs). For some reason, it chooses to do this based on the element size of the operation, not the repetitivity of the constant. Note that clang's asm output repeats the constant for each use, but the final binary doesn't.
MSVC: identical constants share storage. Pretty much the same as what gcc does. (The full asm output is hard to wade through; use search. I could only get the asm at all by having main find the path to the .exe, then work out t
Asked in February 2016Viewed 3,447 timesVoted 14Answered 4 times