需要帮助在给定64位汇编指令的C中构造Long循环（long x，int n）函数

Question

我有来自C函数long loop(long x, int n)的以下汇编代码，其中q在%rdi中，n在%esi中的64位机器上。我已经就我认为装配说明正在做的事情写了我的评论。

loop:

    movl   %esi, %ecx // store the value of n in register ecx
    movl   $1, %edx // store the value of 1 in register edx (rdx).initial mask
    movl   $0, %eax //store the value of 0 in register eax (rax). this is initial return value 
    jmp    .L2 

.L3 

    movq   %rdi, %r8 //store the value of x in register r8  
    andq    %rdx, %r8 //store the value of (x & mask) in r8
    orq    %r8, %rax //update the return value rax by (x & mask | [rax] ) 
    salq   %cl, %rdx //update the mask rdx by ( [rdx] << n)

.L2 

    testq  %rdx, %rdx //test mask&mask
    jne    .L3 // if (mask&mask) != 0, jump to L3    
    rep; ret

我有以下C函数需要对应汇编代码：

    long loop(long x, int n){
          long result = _____ ;   
          long mask; 
       // for (mask = ______; mask ________; mask = ______){ // filled in as:
          for (mask = 1;      mask != 0;     mask <<n) {
              result |= ________;
          } 
          return result;
     }

我需要一些帮助填补空白，我不是100％确定组装说明，但我通过评论每条线给我最好的镜头。

Answer 1

你的评论中已经有了很多。

long loop(long x, long n) {
    long result = 0;
    long mask;
    for (mask = 1; mask != 0; mask <<= n) {
        result |= (x & mask);
    }
    return result;
}

因为result是返回值，并且返回值存储在%rax中，所以movl $0, %eax最初将0加载到result中。

在for循环中，%r8保持result的值，就像你在评论中提到的那样，x & mask只是n。

Answer 2

该函数将每个resultth位复制到bts reg,reg。

为了记录，实现充满了错过的优化，特别是如果我们正在调整Sandybridge系列，其中shl %cl只有1 uop，1c延迟，但bts是3 uops。（英特尔P6 / Atom / Silvermont CPU上的BTS也是1 uop）

bts %rdx, %rax在AMD K8 / Bulldozer / Zen上只有2 uops。 BTS reg，reg以与x86整数移位相同的方式掩盖移位计数，因此rax |= 1ULL << (rdx&0x3f)实现了n。即在RAX中设置位xor。

（这段代码显然很容易理解，甚至不使用最着名的x86窥孔优化，and-zeroing，但看看我们如何有效地实现同样的东西很有趣。）

更明显的是，在循环中做n是不好的。相反，我们可以用每个x & maskth位设置一个掩码，然后返回ret。这具有额外的优势，即在条件分支之后使用非rep指令，即使我们关心调整AMD Phenom CPU中的分支预测值，我们也不需要ret前缀作为# x86-64 System V: x in RDI, n in ESI mask_bitstride: # give the function a meaningful name mov $1, %eax # mask = 1 mov %esi, %ecx # unsigned bitidx = n (new tmp var) # the loop always runs at least 1 iteration, so just fall into it .Lloop: # do { bts %rcx, %rax # rax |= 1ULL << bitidx add %esi, %ecx # bitidx += n cmp $63, %ecx # sizeof(long)*CHAR_BIT - 1 jbe .Lloop # }while(bitidx <= maxbit); // unsigned condition and %rdi, %rax # return x & mask ret # not following a JCC so no need for a REP prefix even on K10的填充。（因为它不是条件分支后的第一个字节。）

我们假设shl在0..63范围内，否则C将具有未定义的行为。在这种情况下，此实现与问题中基于班次的实现不同。 n==64版本将0x40 & 0x3f = 0视为无限循环，因为移位计数= mask，所以bitidx += n永远不会改变。这个idx版本将在第一次迭代后退出，因为n=65立即变为> = 63，即超出范围。

一个不太极端的情况是n=0会复制所有位（移位计数为1）;这只会复制低位。

两个版本都为n创建了一个无限循环。我使用了无符号比较，因此负qazxswpoi将立即退出循环。

在Intel Sandybridge系列中，原始内环为7 uops。（mov = 1 +和= 1 +或= 1 + variable-count-shl = 3 +宏融合测试+ jcc = 1）。这将成为前端或SnB / IvB上ALU吞吐量的瓶颈。

我的版本只有3 uops，运行速度可以快两倍。（每个时钟1次迭代。）

需要帮助在给定64位汇编指令的C中构造Long循环（long x，int n）函数

问题描述投票：1回答：2

2个回答

最新问题

需要帮助在给定64位汇编指令的C中构造Long循环（long x，int n）函数

问题描述 投票：1回答：2

2个回答

最新问题

问题描述投票：1回答：2