Akira February 2016

How can I set __m128i without using of any SSE instruction?

I have many function which use the same constant __m128i values. For example:

const __m128i K8 = _mm_setr_epi8(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16);
const __m128i K16 = _mm_setr_epi16(1, 2, 3, 4, 5, 6, 7, 8);
const __m128i K32 = _mm_setr_epi32(1, 2, 3, 4);

So I want to store all these constants in an one place. But there is a problem: I perform checking of existed CPU extension in run time. If the CPU doesn't support for example SSE (or AVX) than will be a program crash during constants initialization.

So is it possible to initialize these constants without using of SSE?

Answers


Paul R February 2016

I suggest defining the initialisation data globally as scalar data and then load it locally into a const __m128i:

static const uint8_t gK8[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

static inline foo()
{
    const __m128i K8 = _mm_loadu_si128((__m128i *)gK8);

    // ...
}


ErmIg February 2016

Initialization of __m128i vector without using SSE instructions is possible but it depends on how to compiler defines __m128i.

For Microsoft Visual Studio you can define next macros (it defines __m128i as char[16]):

template <class T> inline char GetChar(T value, size_t index)
{
    return ((char*)&value)[index];
}

#define AS_CHAR(a) char(a)

#define AS_2CHARS(a) \
    GetChar(int16_t(a), 0), GetChar(int16_t(a), 1)

#define AS_4CHARS(a) \
    GetChar(int32_t(a), 0), GetChar(int32_t(a), 1), \
    GetChar(int32_t(a), 2), GetChar(int32_t(a), 3)

#define _MM_SETR_EPI8(a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, aa, ab, ac, ad, ae, af) \
    {AS_CHAR(a0), AS_CHAR(a1), AS_CHAR(a2), AS_CHAR(a3), \
     AS_CHAR(a4), AS_CHAR(a5), AS_CHAR(a6), AS_CHAR(a7), \
     AS_CHAR(a8), AS_CHAR(a9), AS_CHAR(aa), AS_CHAR(ab), \
     AS_CHAR(ac), AS_CHAR(ad), AS_CHAR(ae), AS_CHAR(af)}

#define _MM_SETR_EPI16(a0, a1, a2, a3, a4, a5, a6, a7) \
    {AS_2CHARS(a0), AS_2CHARS(a1), AS_2CHARS(a2), AS_2CHARS(a3), \
     AS_2CHARS(a4), AS_2CHARS(a5), AS_2CHARS(a6), AS_2CHARS(a7)}

#define _MM_SETR_EPI32(a0, a1, a2, a3) \
    {AS_4CHARS(a0), AS_4CHARS(a1), AS_4CHARS(a2), AS_4CHARS(a3)}       

For GCC it will be (it defines __m128i as long long[2]):

#define CHAR_AS_LONGLONG(a) (((long long)a) & 0xFF)

#define SHORT_AS_LONGLONG(a) (((long long)a) & 0xFFFF)

#define INT_AS_LONGLONG(a) (((long long)a) & 0xFFFFFFFF)

#define LL_SETR_EPI8(a, b, c, d, e, f, g, h) \
    CHAR_AS_LONGLONG(a) | (CHAR_AS_LONGLONG(b) << 8) | \
    (CHAR_AS_LONGLONG(c) << 16) | (CHAR_AS_LONGLONG(d) << 24) | \
    (CHAR_AS_LONGLONG(e) << 32) | (CHAR_AS_LONGLONG(f) << 40) | \
    (CHAR_AS_LONGLONG(g) << 48) | (CHAR_AS_LONGLONG(h) << 56)

#define LL_SETR_EPI16(a, b, c, d) \
    SHORT_AS_LONGLONG(a) | (SHORT_AS_LONGLONG(b) << 16) | \
    (SHORT_AS_LONGLONG(c) << 32) | (SHORT_AS_LONGLONG(d) << 48 


Fabio February 2016

You can use a union.

union M128 {
   char[16] i8;
   __m128i i128;
};

const M128 k8 = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

If the M128 union is defined locally where you use the loop, this should have no performance overhead (it will be loaded in memory once at the begin of the loop). Because it contains a variable of type __m128i, M128 inherits the correct alignment.

void foo()
{
   M128 k8 = ...;
   // use k8.i128 in your for loop
}

If it is defined somewhere else, then you need to copy into a local register before you start the loop, otherwise the compiler may not be able to optimize it.

void foo()
{
    __m128i tmp = k8.i128;
    // for loop here
}

This will load k8 into a cpu register and keep it there for the duration of the loop, as long as there enough free registers to carry out the loop body.

Depending on what compiler you use, these unions may be already defined (VS does), but the compiler's provided definitions may not be portable.


Peter Cordes February 2016

You usually don't need this. Compilers are very good at using the same storage for multiple functions that use the same constant. Just like merging multiple instances of the same string literal into one string constant, multiple instances of the same _mm_set* in different functions will all load from the same vector constant (or generate on the fly for _mm_setzero_si128() or _mm_set1_epi8(-1)).

Using Godbolt's binary output (disassembly) mode lets you see whether different functions are loading from the same block of memory or not. Look at the comment it adds, which resolves the RIP-relative addresses to absolute addresses.

  • gcc: all identical constants share the same storage, regardless of whether they're from auto-vectorization or _mm_set. 32B constants can't overlap with 16B constants, even if the 16B constant is a subset of the 32B.

  • clang: identical constants share storage. 16B and 32B constants don't overlap, even when one is a subset of the other. Some functions using repetitive constants use an AVX2 vpbroadcastd broadcast-load (which doesn't even take an ALU uop on Intel SnB-family CPUs). For some reason, it chooses to do this based on the element size of the operation, not the repetitivity of the constant. Note that clang's asm output repeats the constant for each use, but the final binary doesn't.

  • MSVC: identical constants share storage. Pretty much the same as what gcc does. (The full asm output is hard to wade through; use search. I could only get the asm at all by having main find the path to the .exe, then work out t

Post Status

Asked in February 2016
Viewed 3,447 times
Voted 14
Answered 4 times

Search




Leave an answer