`__uint128_t`上效率最高的popcount？

Question

我需要以最有效（最快）的方式使用128位大小的无符号变量。

操作系统：Linux / Debian 9
编译：GCC 8
CPU：Intel i7-5775C

虽然如果解决方案更便携，甚至更好。

首先，GCC有两种类型，即__uint128_t和unsigned __int128。我猜它们最终是相同的，并且没有理由写出丑陋的unsigned __int128的东西，所以尽管它应该是新类型，我更喜欢第一个，它更像是标准的uint64_t。此外，英特尔还有__uint128_t这是使用它的另一个原因（便携性）。

我写了以下代码：

#include <nmmintrin.h>
#include <stdint.h>

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const uint64_t      n_hi    = n >> 64;
    const uint64_t      n_lo    = n;
    const uint_fast8_t  cnt_hi  = _mm_popcnt_u64(n_hi);
    const uint_fast8_t  cnt_lo  = _mm_popcnt_u64(n_lo);
    const uint_fast8_t  cnt     = cnt_hi + cnt_lo;

    return  cnt;
}

这是绝对最快的选择吗？

编辑：

另一种选择出现在我脑海中，可能（或不）更快：

#include <nmmintrin.h>
#include <stdint.h>

union   Uint128 {
    __uint128_t uu128;
    uint64_t    uu64[2];
};

static inline   uint_fast8_t    popcnt_u128 (__uint128_t n)
{
    const union Uint128 n_u     = {.uu128   = n};
    const uint_fast8_t  cnt_a   = _mm_popcnt_u64(n_u.uu64[0]);
    const uint_fast8_t  cnt_b   = _mm_popcnt_u64(n_u.uu64[1]);
    const uint_fast8_t  cnt     = cnt_a + cnt_b;

    return  cnt;
}

这样，虽然我不知道它是否合法（是吗？（编辑：它是：Type punning between integer and array using `union`?）），我会避免这种转变。

Answer 1

使用GCC和clang，如果删除static inline，你的函数都会编译为相同的asm，并且可能会等效地内联。

我建议使用unsigned，因为x86-64 Linux上的sizeof(uint_fast8_t) = 1。 _fast类型提出“为什么目的快”的问题; fast8适用于数组中的紧凑存储，fast32是64位类型，它可以避免重做符号或零扩展指针数学，但浪费数组中的空间。

clang知道两个popcnt结果的总和在没有溢出的情况下适合8位整数，所以它可以优化掉零扩展，即使你将结果加到unsigned计数器中，但gcc没有。（例如，将返回类型更改为unsigned，您将获得额外的movzx eax, dil指令。）硬件popcnt指令产生的结果正确地零扩展到64位，但是分配给uint_fast8_t aka uint8_t明确要求编译器截断结果到8位。

x86-64 System V ABI允许args中的高垃圾和返回值，因此当返回类型很窄时，该函数的独立版本可以允许进入高位EAX。

我会避免这种转变。

这种转变只存在于C源中。在asm中，高/低半部分将存储在单独的64位寄存器或单独的存储器源操作数中。

来自the Godbolt compiler explorer

# gcc8.3 -O3 -march=haswell  for the union and the shift version
popcnt_u128:
    xor     eax, eax    # break popcnt's false dependency on Intel CPUs
    popcnt  rsi, rsi    # _mm_popcnt_u64(n_hi);
    popcnt  rax, rdi    # popcnt(lo)
    add     eax, esi        # clang uses add al,cl  and doesn't avoid false deps except in a loop
    ret                 # return value in AL (low 8 bits of EAX)

GCC可以通过使用两个popcnts并使用lea eax, [rdi + rsi]来避免xor-zeroing。但是你说了一个关于数组的东西，所以如果数据来自内存，那么GCC通常会移动加载然后popcnt到位以避免错误依赖。（Why does breaking the "output dependency" of LZCNT matter?）或者实际上，它将xor-zero目标，然后使用memory-source popcnt，这可能是稍小的代码大小。

我不相信__builtin_popcountll因为它使用long long而不是uint64_t。我认为创建一个处理位的函数并使用不是固定宽度的类型是疯狂的。我不知道海湾合作委员会的人在考虑什么。

它实际上使用unsigned long long，而不是签署long long;那会很疯狂。

unsigned long long至少为64位，uint64_t必须精确为64位。（实际上只存在于C实现上，其类型恰好是64位而没有填充;支持它是可选的）。我不确定GNU C是否支持unsigned long long不是64位的任何目标，或者uint64_t不可用的目标。甚至int64_t，也需要2的补充。（如果GCC支持任何非2的补充目标，则为IDK。）

您可以将输入转换为uint64_t以确保没有设置更高的位。即使在uint64_t宽于64位的平台上，从unsigned long long到ULL的隐式转换也不会设置任何额外的位。

例如__builtin_popcountll( (uint64_t)n );将始终安全地计算n的低64位，无论unsigned long long的宽度如何。

我正在使用一个非常大的静态数组。我是否需要关心缓存，或者GCC是否会为我处理？我认为这只是malloc和那些东西的问题。 GCC在编译时知道数组，所以它可以做得比我好。

GCC（几乎？）永远不会重新安排你的循环来改变内存访问模式。静态数组与malloced内存没有本质的区别;他们不会免费在缓存中保持热度。请参阅What Every Programmer Should Know About Memory?以了解更多信息。

但是如果你只是顺序循环通过内存并弹出整个数组，那么无论你是否使用__uint128_t都没关系。

clang将使用AVX2 __builtin_popcntll（作为半字节LUT）在阵列上自动矢量化_mm_popcnt_u64或vpshufb，这对包括Broadwell在内的Intel CPU很有用。见Counting 1 bits (population count) on large data using AVX-512 or AVX-2

但不幸的是，使用你的包装函数为__uint128_t数组失败了。请参阅Godbolt链接中的最后两个功能。

`__uint128_t`上效率最高的popcount？

问题描述投票：6回答：1

1个回答

最新问题

`__uint128_t`上效率最高的popcount？

问题描述 投票：6回答：1

1个回答

最新问题

问题描述投票：6回答：1