矢量化随机初始化，并使用AVX2对具有十进制数字数组的BigInt进行打印？

Question

如何将我的代码传递给AVX2代码并获得与以前相同的结果？

是否可以在LongNumInit，LongNumPrint函数中使用__m256i代替uint8_t *L或某些类似类型的变量？

我对AVX的了解非常有限；我进行了很多调查，但是我对如何转换我的代码不太了解，欢迎提出任何建议和解释。

我对AVX2中的这段代码非常感兴趣。

void LongNumInit(uint8_t *L, size_t N )
{
  for(size_t i = 0; i < N; ++i){
      L[i] = myRandom()%10;
  }

}
void LongNumPrint( uint8_t *L, size_t N, uint8_t *Name )
{
  printf("%s:", Name);
  for ( size_t i=N; i>0;--i )
  {
    printf("%d", L[i-1]);
  }
  printf("\n");
}
int main (int argc, char **argv)
{
  int i, sum1, sum2, sum3, N=10000, Rep=50;

  seed = 12345;

  // obtain parameters at run time
  if (argc>1) { N    = atoi(argv[1]); }
  if (argc>2) { Rep  = atoi(argv[2]); }

 // Create Long Nums
  unsigned char *V1= (unsigned char*) malloc( N);
  unsigned char *V2= (unsigned char*) malloc( N);
  unsigned char *V3= (unsigned char*) malloc( N);
  unsigned char *V4= (unsigned char*) malloc( N);

  LongNumInit ( V1, N ); LongNumInit ( V2, N ); LongNumInit ( V3, N );

//Print last 32 digits of Long Numbers
  LongNumPrint( V1, 32, "V1" );
 LongNumPrint( V2, 32, "V2" );
  LongNumPrint( V3, 32, "V3" );
  LongNumPrint( V4, 32, "V4" );

  free(V1); free(V2); free(V3); free(V4);
  return 0;
}

我在初始代码中获得的结果是这样：

V1:59348245908804493219098067811457
V2:24890422397351614779297691741341
V3:63392771324953818089038280656869
V4:00000000000000000000000000000000

Answer 1

通常，这对于BigInteger来说是一种糟糕的格式，请参阅https://codereview.stackexchange.com/a/237764，以查看对BigInteger使用每字节一个十进制数字的设计缺陷的代码回顾，以及您可以/应该做的事情。

并且请参阅Can long integer routines benefit from SSE?，以获取@Mysticial的有关存储数据的方法的注释，这些方法使SIMD for BigInteger数学变得实用，特别是部分单词算术，在这种情况下，临时人可能不会“标准化”，从而使您可以进行懒惰的进位处理。

但是显然您只是在问有关[[this代码，random-init和print函数，而不是如何在这种格式的两个数字之间进行数学运算。

LongNumInit
What's the fastest way to generate a 1 GB text file containing random digits?在4GHz Skylake上以33 GB / s的速度生成以空格分隔的随机ASCII十进制数字，包括write()对/dev/null的系统调用的开销。（这高于DRAM带宽； 128kiB的缓存阻止使存储命中L2缓存。/dev/null的内核驱动程序甚至不读取用户空间缓冲区。）
可以轻松将其改编为void LongNumInit(uint8_t *L, size_t N )的AVX2版本。我的回答是使用像__m256i的AVX2 xorshift128 + PRNG（在AVX/SSE version of xorshift128+的64位元素中使用4个独立的PRNG矢量化）。那应该与您的rand() % 10具有相似的随机性。
[通过乘法逆将其分解为十进制数字，并使用vpmulhuw使用移位和Why does GCC use multiplication by a strange number in implementing integer division?将其除以10并取模。（实际上使用GNU C本机矢量语法让GCC确定魔术常数并发出乘积和移位，以方便语法，例如v16u dig1 = v % ten;和v /= ten;）
您可以使用_mm256_packus_epi16将两个16位数字的向量打包为8位元素，而不是将奇数元素转换为ASCII ' '而将偶数元素转换为ASCII '0'..'9'。（因此，更改vec_store_digit_and_space以打包成对的向量，而不是与常量进行“或”运算。）
用gcc，clang或ICC（或希望理解C99的GNU C语言和Intel的内在函数的任何其他编译器进行编译）。>>
请参阅https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html的__attribute__((vector_size(32)))部分，以及https://software.intel.com/sites/landingpage/IntrinsicsGuide/的_mm256_*部分。也是https://stackoverflow.com/tags/sse/info。
#include <immintrin.h> // GNU C native vectors let us get the compiler to do stuff like %10 each element typedef unsigned short v16u __attribute__((vector_size(32))); // returns p + size of stores. Caller should use outpos = f(vec, outpos) // p must be aligned __m256i* vec_store_digit_and_space(__m256i vec, __m256i *restrict p) { v16u v = (v16u)vec; v16u ten = (v16u)_mm256_set1_epi16(10); v16u divisor = (v16u)_mm256_set1_epi16(6554); // ceil((2^16-1) / 10.0) v16u div6554 = v / divisor; // Basically the entropy from the upper two decimal digits: 0..65. // Probably some correlation with the modulo-based values, especially dig3, but we do this instead of // dig4 for more ILP and fewer instructions total. v16u dig1 = v % ten; v /= ten; v16u dig2 = v % ten; v /= ten; v16u dig3 = v % ten; // dig4 would overlap much of the randomness that div6554 gets const v16u ascii_digitspace = (v16u)_mm256_set1_epi16( (' '<<8) | '0'); // __m256i or v16u assignment is an aligned store v16u *vecbuf = (v16u*)p; vecbuf[0] = _mm256_packus_epi16(div6554, dig1); vecbuf[1] = _mm256_packus_epi16(dig2, dig3) return p + 2; // always a constant number of full vectors }
random_decimal_fill_buffer中插入换行符的逻辑可以完全删除，因为您只需要一个十进制数字的平面数组。只需循环调用上面的函数，直到填满缓冲区。
处理小尺寸（小于完整向量）：
将您的malloc填充到下一个32字节的倍数会很方便，因此在不检查是否可能跨入未映射页面的情况下，始终可以安全地进行32字节加载。
并使用C11 aligned_alloc获得32字节对齐的存储。例如，
aligned_alloc
。即使N为奇数，这也使我们只进行完整的32字节存储。从逻辑上讲，只有缓冲区的前N个字节保存了我们的真实数据，但是使用填充可以方便地进行书写，以避免对N小于32或不是32的倍数的任何额外条件检查。[不幸的是，ISO C和glibc缺少aligned_alloc(32, (size+31) & -32)和aligned_realloc。 MSVC确实提供了这些功能：aligned_calloc有时使您可以在对齐缓冲区的末尾分配更多空间而无需复制它。如果非平凡的可复制对象更改地址，则“ try_realloc”对于C ++可能是需要运行复制构造函数的理想选择。有时会强制进行不必要复制的非表达性分配器API是我的宠儿。
Why is there no 'aligned_realloc' on most platforms?
采用LongNumPrint arg是错误的设计。如果呼叫者想先打印一个uint8_t *Name字符串，他们可以这样做。您的函数应该只执行"something:" printf对"%d"的作用。
由于要以相反的打印顺序存储数字，因此需要将字节反转到tmp缓冲区中，并通过与int进行或运算将0..9字节值转换为'0'..'9' ASCII字符值。然后将该缓冲区传递到'0'。
特别是，将fwrite用作局部变量。
您可以处理固定大小的块（例如1kiB或8kiB），而不是分配可能很大的缓冲区。您可能仍然希望通过stdio（而不是直接通过alignas(32) char tmpbuf[8192];并管理自己的I / O缓冲）。使用8kiB缓冲区，高效的write()可能会将其直接传递给fwrite，而不是将memcpy传递到stdio缓冲区。您可能需要尝试进行调整，但是将tmp缓冲区舒适地保持为小于L1d缓存的一半，这意味着在编写它后重新读取时，它在缓存中仍然很热。
缓存阻塞使循环边界复杂得多，但对于很大的N来说是值得的。
一次反转字节32个字节
：您可以通过确定数字以MSD优先顺序存储来避免这项工作，但是，如果您确实想实现加法，则必须从头到尾循环。
您的函数可以用SIMD write()实现，以反转16字节的块，从数字数组的末尾开始到tmp缓冲区的开头。
或者更好，加载_mm_shuffle_epi8 /vmovdqu16字节加载以将vinserti128馈送到通道内的字节反转，设置为32字节存储。
在Intel CPU上，_mm256_shuffle_epi8解码为load + ALU uop，但它可以在任何矢量ALU端口上运行，而不仅在shuffle端口上运行。因此，两个128位加载要比256位加载-> vinserti128-> vpshufb更有效，如果高速缓存中的数据很热，这可能会影响随机端口吞吐量。英特尔CPU每个时钟周期最多可以执行2个加载+ 1个存储（或者在IceLake中，可以执行2个加载+ 2个存储）。如果没有内存瓶颈，我们可能会在前端出现瓶颈，因此在实践中不会使load + store和shuffle端口饱和。（vpermq和https://agner.org/optimize/）
通过假设我们始终可以从https://uops.info/读取32个字节而不必进入未映射的页面，也简化了此功能。但是在对小N进行32字节反转之后，输入的前N个字节变为32字节块中的最后N个字节。如果我们总是可以安全地在缓冲区的末尾进行32字节的加载，则最方便，但是期望在对象之前进行填充是不合理的。
L
此编译（#include <immintrin.h> #include <stdalign.h> #include <stddef.h> #include <stdio.h> #include <stdint.h> // one vector of 32 bytes of digits, reversed and converted to ASCII static inline void ASCIIrev32B(void *dst, const void *src) { __m128i hi = _mm_loadu_si128(1 + (const __m128i*)src); // unaligned loads __m128i lo = _mm_loadu_si128(src); __m256i v = _mm256_set_m128i(lo, hi); // reverse 128-bit hi/lo halves // compilers will hoist constants out of inline functions __m128i byterev_lane = _mm_set_epi8(0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15); __m256i byterev = _mm256_broadcastsi128_si256(byterev_lane); // same in each lane v = _mm256_shuffle_epi8(v, byterev); // in-lane reverse v = _mm256_or_si256(v, _mm256_set1_epi8('0')); // digits to ASCII _mm256_storeu_si256(dst, v); // Will usually be aligned in practice. } // UNTESTED, could be bugs in the loop bounds. // returns bytes written, like fwrite: 0 on error size_t LongNumPrint( uint8_t *num, size_t N) { // caller can print a name if it wants const int revbufsize = 8192; // 8kiB on the stack should be fine alignas(32) char revbuf[revbufsize]; if (N<32) { // TODO: maybe use a smaller revbuf for this case to avoid touching new stack pages ASCIIrev32B(revbuf, num); // the data we want is at the *end* of a 32-byte reverse return fwrite(revbuf+32-N, 1, N, stdout); } size_t bytes_written = 0; const uint8_t *inp = num+N; // start with last 32 bytes of num[] do { size_t chunksize = (inp - num >= revbufsize) ? revbufsize : inp - num; const uint8_t *inp_stop = inp - chunksize + 32; // leave one full vector for the end uint8_t *outp = revbuf; while (inp > inp_stop) { // may run 0 times inp -= 32; ASCIIrev32B(outp, inp); outp += 32; } // reverse first (lowest address) 32 bytes of this chunk of num // into last 32 bytes of this chunk of revbuf // if chunksize%32 != 0 this will overlap, which is fine. ASCIIrev32B(revbuf + chunksize - 32, inp_stop - 32); bytes_written += fwrite(revbuf, 1, chunksize, stdout); } while ( inp > num ); return bytes_written; // caller can putchar('\n') if it wants }），但
unested。如果我弄错了一些指针数学运算符，我也不会感到惊讶，但是，chunksize = min（ptrdiff，8k）并使用其从on Godbolt末尾向下循环的一般思想应该是可靠的。请注意，从技术上来说，从num[]的开始到inp的递减是C UB，但是我认为GCC假设使用平面内存模型，并且解因定义了指针的行为。普通的OS会保留零页，因此num绝对不能在物理内存开始的32个字节之内（因此num不能包装到高位地址。）
如果我们在开始主循环之前转换了第一个inp字节并将其传递给N%32，则可以加载（而不是存储）对齐的矢量。但这可能导致额外的fwrite系统调用，或者导致stdio内部的笨拙复制。（除非还有尚未打印的缓冲文本，例如write()，在这种情况下，我们已经受到处罚。）

矢量化随机初始化，并使用AVX2对具有十进制数字数组的BigInt进行打印？

问题描述投票：-1回答：1

1个回答

将您的malloc填充到下一个32字节的倍数会很方便，因此在不检查是否可能跨入未映射页面的情况下，始终可以安全地进行32字节加载。

采用`LongNumPrint` arg是错误的设计。如果呼叫者想先打印一个`uint8_t *Name`字符串，他们可以这样做。您的函数应该只执行`"something:"` `printf`对`"%d"`的作用。

最新问题

矢量化随机初始化，并使用AVX2对具有十进制数字数组的BigInt进行打印？

问题描述 投票：-1回答：1

1个回答

将您的malloc填充到下一个32字节的倍数会很方便，因此在不检查是否可能跨入未映射页面的情况下，始终可以安全地进行32字节加载。

采用LongNumPrint arg是错误的设计。如果呼叫者想先打印一个uint8_t *Name字符串，他们可以这样做。您的函数应该只执行"something:" printf对"%d"的作用。

最新问题

问题描述投票：-1回答：1

采用`LongNumPrint` arg是错误的设计。如果呼叫者想先打印一个`uint8_t *Name`字符串，他们可以这样做。您的函数应该只执行`"something:"` `printf`对`"%d"`的作用。