如何让GCC为没有内置的大端存储生成bswap指令？

Question

我正在研究一种以大端格式将64位值存储到内存中的函数。我希望我能编写可在小端和大端平台上运行的可移植C99代码，并让现代x86编译器自动生成bswap指令而无需任何内置函数或内在函数。所以我开始使用以下功能：

#include <stdint.h>

void
encode_bigend_u64(uint64_t value, void *vdest) {
    uint64_t bigend;
    uint8_t *bytes = (uint8_t*)&bigend;
    bytes[0] = value >> 56;
    bytes[1] = value >> 48;
    bytes[2] = value >> 40;
    bytes[3] = value >> 32;
    bytes[4] = value >> 24;
    bytes[5] = value >> 16;
    bytes[6] = value >> 8;
    bytes[7] = value;
    uint64_t *dest = (uint64_t*)vdest;
    *dest = bigend;
}

这适用于clang，它将此函数编译为：

bswapq  %rdi
movq    %rdi, (%rsi)
retq

但GCC fails to detect the byte swap。我尝试了几种不同的方法，但它们只会让事情变得更糟。我知道GCC可以使用按位和，移位和按位来检测字节交换，但是为什么写字节时它不起作用？

编辑：我找到了相应的GCC bug。

Answer 1

这似乎可以解决问题：

void encode_bigend_u64(uint64_t value, void* dest)
{
  value =
      ((value & 0xFF00000000000000u) >> 56u) |
      ((value & 0x00FF000000000000u) >> 40u) |
      ((value & 0x0000FF0000000000u) >> 24u) |
      ((value & 0x000000FF00000000u) >>  8u) |
      ((value & 0x00000000FF000000u) <<  8u) |      
      ((value & 0x0000000000FF0000u) << 24u) |
      ((value & 0x000000000000FF00u) << 40u) |
      ((value & 0x00000000000000FFu) << 56u);
  memcpy(dest, &value, sizeof(uint64_t));
}

clang with `-O3`

encode_bigend_u64(unsigned long, void*):
        bswapq  %rdi
        movq    %rdi, (%rsi)
        retq

clang with `-O3 -march=native`

encode_bigend_u64(unsigned long, void*):
        movbeq  %rdi, (%rsi)
        retq

gcc with `-O3`

encode_bigend_u64(unsigned long, void*):
        bswap   %rdi
        movq    %rdi, (%rsi)
        ret

gcc with `-O3 -march=native`

encode_bigend_u64(unsigned long, void*):
        movbe   %rdi, (%rsi)
        ret

在http://gcc.godbolt.org/上使用clang 3.8.0和gcc 5.3.0进行测试（所以我不知道下面是什么处理器（对于-march=native）但我强烈怀疑最近的x86_64处理器）

如果你想要一个适用于大端架构的函数，你可以使用here的答案来检测系统的字节序并添加一个if。 union和指针转换版本都工作，并由gcc和clang优化，导致完全相同的程序集（没有分支）。 Full code on godebolt：

int is_big_endian(void)
{
    union {
        uint32_t i;
        char c[4];
    } bint = {0x01020304};

    return bint.c[0] == 1;
}

void encode_bigend_u64_union(uint64_t value, void* dest)
{
  if (!is_big_endian())
    //...
  memcpy(dest, &value, sizeof(uint64_t));
}

Intel® 64 and IA-32 Architectures Instruction Set Reference（3-542 Vol.2A）：

MOVBE-交换字节后移动数据

对从第二个操作数（源操作数）复制的数据执行字节交换操作，并将结果存储在第一个操作数（目标操作数）中。 [...]

MOVBE指令用于交换从存储器读取或写入存储器的字节;从而为将little-endian值转换为big-endian格式提供支持，反之亦然。

Answer 2

此答案中的所有函数都使用Godbolt Compiler Explorer上的asm输出

GNU C has a uint64_t __builtin_bswap64 (uint64_t x)，自GNU C 4.3以来。这显然是让gcc / clang生成代码的最可靠方法。

glibc根据机器的字节顺序提供htobe64，htole64以及与BE和LE函数相似的主机交换与否。请参阅<endian.h>的文档。该手册页称它们已在版本2.9（2008-11发布）中添加到glibc中。

#define _BSD_SOURCE             /* See feature_test_macros(7) */

#include <stdint.h>

#include <endian.h>
// ideal code with clang from 3.0 onwards, probably earlier
// ideal code with gcc from 4.4.7 onwards, probably earlier
uint64_t load_be64_endian_h(const uint64_t *be_src) { return be64toh(*be_src); }
    movq    (%rdi), %rax
    bswap   %rax

void store_be64_endian_h(uint64_t *be_dst, uint64_t data) { *be_dst = htobe64(data); }
    bswap   %rsi
    movq    %rsi, (%rdi)

// check that the compiler understands the data movement and optimizes away a double-conversion (which inline-asm `bswap` wouldn't)
// it does optimize away with gcc 4.9.3 and later, but not with gcc 4.9.0 (2x bswap)
// optimizes away with clang 3.7.0 and later, but not clang 3.6 or earlier (2x bswap)
uint64_t double_convert(uint64_t data) {
  uint64_t tmp;
  store_be64_endian_h(&tmp, data);
  return load_be64_endian_h(&tmp);
}
    movq    %rdi, %rax

即使在这些函数的-O1中你也可以安全地获得良好的代码，当movbe设置为支持insn的CPU时，它们会使用-march。

如果你的目标是GNU C，而不是glibc，你可以从glibc借用这个定义（记住它的LGPLed代码）：

#ifdef __GNUC__
# if __GNUC_PREREQ (4, 3)

static __inline unsigned int
__bswap_32 (unsigned int __bsx) { return __builtin_bswap32 (__bsx);  }

# elif __GNUC__ >= 2
    // ... some fallback stuff you only need if you're using an ancient gcc version, using inline asm for non-compile-time-constant args
# endif  // gcc version
#endif // __GNUC__

如果你真的需要一个可以在不支持GNU C内置编译器的编译器上编译好的回退，那么来自@ bolov的答案的代码可以用来实现一个编译得很好的bswap。预处理器宏可用于选择是否交换（like glibc does），以实现主机到BE和主机到LE功能。当bswap used by glibc或x86 asm不可用时，__builtin_bswap使用bolov发现的掩盖和移位成语很好。 gcc认识到它比转移更好。

来自this Endian-agnostic coding blog post的代码用gcc编译为bswap，但不用clang编译。 IDK，如果他们的模式识别器都能识别出任何东西。

// Note that this is a load, not a store like the code in the question.
uint64_t be64_to_host(unsigned char* data) {
    return
      ((uint64_t)data[7]<<0)  | ((uint64_t)data[6]<<8 ) |
      ((uint64_t)data[5]<<16) | ((uint64_t)data[4]<<24) |
      ((uint64_t)data[3]<<32) | ((uint64_t)data[2]<<40) |
      ((uint64_t)data[1]<<48) | ((uint64_t)data[0]<<56);
}

    ## gcc 5.3 -O3 -march=haswell
    movbe   (%rdi), %rax
    ret

    ## clang 3.8 -O3 -march=haswell
    movzbl  7(%rdi), %eax
    movzbl  6(%rdi), %ecx
    shlq    $8, %rcx
    orq     %rax, %rcx
    ... completely naive implementation

来自htonll的this answer编译为两个32位bswaps与shift /或。这种糟糕，但无论是gcc还是clang都不是很糟糕。

我对OP的代码的union { uint64_t a; uint8_t b[8]; }版本没有任何运气。 clang仍然编译成64位bswap，但我认为用gcc编译甚至更糟糕的代码。（参见godbolt链接）。

Answer 3

我喜欢彼得的解决方案，但这里有一些你可以在Haswell上使用的东西。 Haswell有movbe指令，那里是3 uops（没有比bswap r64便宜+正常载荷或商店），但在Atom / Silvermont（https://agner.org/optimize/）上更快：

// AT&T syntax, compile without -masm=intel
inline
uint64_t load_bigend_u64(uint64_t value)
{
    __asm__ ("movbe %[src], %[dst]"   // x86-64 only
             :  [dst] "=r" (value)
             :  [src] "m" (value)
            );
    return value;
}

使用像uint64_t tmp = load_bigend_u64(array[i]);这样的东西

您可以将其反转以生成store_bigend函数，或使用bswap修改寄存器中的值并让编译器加载/存储它。

我改变函数返回value因为vdest的对齐对我来说不清楚。

通常，预处理器宏会保护某个功能。我希望__MOVBE__用于movbe功能标志，但它不存在（this machine has the feature）：

$ gcc -march=native -dM -E - < /dev/null | sort
...
#define __LWP__ 1
#define __LZCNT__ 1
#define __MMX__ 1
#define __MWAITX__ 1
#define __NO_INLINE__ 1
#define __ORDER_BIG_ENDIAN__ 4321
...

如何让GCC为没有内置的大端存储生成bswap指令？

问题描述投票：19回答：3

3个回答

clang with `-O3`

clang with `-O3 -march=native`

gcc with `-O3`

gcc with `-O3 -march=native`

最新问题

如何让GCC为没有内置的大端存储生成bswap指令？

问题描述 投票：19回答：3

3个回答

clang with -O3

clang with -O3 -march=native

gcc with -O3

gcc with -O3 -march=native

最新问题

问题描述投票：19回答：3

clang with `-O3`

clang with `-O3 -march=native`

gcc with `-O3`

gcc with `-O3 -march=native`