如何优化Cortex-M3的滤波器环路？

Question

我只需要改变代码，以便它执行相同的基本功能但更优化，基本上我认为过滤器循环是可以改变的主要代码片段，因为我觉得那里有太多的指令，但是没有知道从哪里开始。我正在使用Cortex M3和Thumb 2。

我试图篡改过滤器循环，这样我就可以添加存储在寄存器中的前一个数字并除以8，但我不知道如何真正执行它。

; Perform in-place filtering of data supplied in memory
; the filter to be applied is a non-recursive filter of the form
; y[0] = x[-2]/8 + x[-1]/8 + x[0]/4 + x[1]/8 + x[2]/8

  ; set up the exception addresses
  THUMB
  AREA RESET, CODE, READONLY
  EXPORT __Vectors
  EXPORT Reset_Handler
__Vectors 
  DCD 0x00180000     ; top of the stack 
  DCD Reset_Handler  ; reset vector - where the program starts

num_words EQU (end_source-source)/4  ; number of input values
filter_length EQU 5  ; number of filter taps (values)

  AREA 2a_Code, CODE, READONLY
Reset_Handler
  ENTRY
  ; set up the filter parameters
  LDR r0,=source        ; point to the start of the area of memory holding inputs
  MOV r1,#num_words     ; get the number of input values
  MOV r2,#filter_length ; get the number of filter taps
  LDR r3,=dest          ; point to the start of the area of memory holding outputs

  ; find out how many times the filter needs to be applied
  SUBS r4,r1,r2   ; find the number of applications of the filter needed, less 1
  BMI exit        ; give up if there is insufficient data for any filtering

  ; apply the filter  
filter_loop
  LDMIA r0,{r5-r9}     ; get the next 5 data values to be filtered
  ADD r5,r5,r9         ; sum x[-2] with x[2]
  ADD r6,r6,r8         ; sum x[-1] with x[1]
  ADD r9,r5,r6         ; sum x[-2]+x[2] with x[-1]+x[1]
  ADD r7,r7,r9,LSR #1  ; sum x[0] with (x[-2]+x[2]+x[-1]+x[1])/2
  MOV r7,r7,LSR #2     ; form (x[0] + (x[-2]+x[-1]+x[1]+x[2])/2)/4
  STR r7,[r3],#4       ; save calculated filtered value, move to next output data item
  ADD r0,r0,#4         ; move to start of next 5 input data values
  SUBS r4,r4,#1        ; move on to next set of 5 inputs 
  BPL filter_loop      ; continue until last set of 5 inputs reached

  ; execute an endless loop once done 
exit    
  B exit

  AREA 2a_ROData, DATA, READONLY
source  ; some saw tooth data to filter - should blunt the sharp edges
  DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
  DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
  DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
  DCD 0,10,20,30,40,50,60,70,80,90,100,0,10,20,30,40,50,60,70,80,90,100
end_source

  AREA 2a_RWData, DATA, READWRITE
dest  ; copy to this area of memory
  SPACE end_source-source
end_dest
  END
  END

我希望有一种更有效的方式来运行代码，天气可以减少代码的整体大小或加快周期的执行时间，只要它做同样的事情。任何帮助，将不胜感激。

Answer 1

对于代码大小，尝试仅使用可用于短16位编码的寄存器r0..r7。

此外，当非标志设置版本需要32位时，具有标志设置的指令版本通常具有16位编码。例如

adds r0, #4是16位与32位add r0, #4
movs r7,r7,LSR #2是16位与32位MOV r7,r7,LSR #2
movs r2,#filter_length是16位与32位MOV r2,#filter_length。（像#88这样的非微小的中介仍需要一个32位的Thumb2 mov）
stmia r3!, {r5}（带回写）是16位与32位str r7, [r3], #4，后增量。

请参阅我之前问题的答案的Thumb代码大小部分：How do I reduce execution time and number of cycles for a factorial loop? And/or code-size?。查看代码的反汇编并查找32位指令，并检查它们为何为32位，并寻找使它们成为16位的方法。这只是您可以随时执行的超级基本Thumb优化。

r1和r2甚至没有在你的循环中使用，而r4 = r1-r2是一个汇编时常数，你在运行时用3条指令计算......所以这对movs r4, #num_words - filter_length来说显然很疯狂。

如果那些应该是在实际代码的汇编时间不知道的输入（可能在不同的输入上有时使用相同的函数？），那么在计算循环计数器之后重用“死”的寄存器。你接受r0和r3中的指针是有点笨拙的，所以如果你使用r2作为循环计数器你就可以免费使用r4-r7和r1，如果你使用r1-r2则使用r5-r7和r4。

我选择使用r1作为循环计数器。这是我的版本（arm-none-eabi-gcc -g -c -mthumb -mcpu=cortex-m3 arm-filter.S && arm-none-eabi-objdump -drwC arm-filter.o）的反汇编

@@ Saving code size without any other changes

00000000 <function>:
   0:   480a            ldr     r0, [pc, #40]   ; (2c <exit+0x4>)
   2:   f05f 0158       movs.w  r1, #88 ; 0x58
   6:   2205            movs    r2, #5
   8:   4b09            ldr     r3, [pc, #36]   ; (30 <exit+0x8>)
   a:   1a89            subs    r1, r1, r2
   c:   d40c            bmi.n   28 <exit>

0000000e <filter_loop>:
   e:   e890 00f4       ldmia.w r0, {r2, r4, r5, r6, r7}
  12:   443a            add     r2, r7
  14:   4434            add     r4, r6
  16:   4414            add     r4, r2
  18:   eb15 0554       adds.w  r5, r5, r4, lsr #1
  1c:   08ad            lsrs    r5, r5, #2
  1e:   c320            stmia   r3!, {r5}
  20:   3004            adds    r0, #4
  22:   3901            subs    r1, #1
  24:   d5f3            bpl.n   e <filter_loop>

00000026 <exit>:
  26:   e7fe            b.n     26 <exit>

Cortex-M3没有NEON，但输出之间存在数据重用。通过展开，我们绝对可以重用负载结果，以及一些“内部”add结果。也许用一个滑动窗口来减去不再是总数的一部分，并添加新的一个。

但是由于中间元素是“特殊的”，我们在两侧都有两个2元素窗口，除非我们在顶部有足够的备用位来添加x[0]两次然后右移3而不会溢出。然后你甚至不需要展开，只需加载1个元素/调整滑动窗口并重新计算中间/存储1元素。

（这个答案的第一个版本是基于对代码的误解。我可能会在稍后更新速度优化，但现在编辑删除错误的东西。）

如何优化Cortex-M3的滤波器环路？

问题描述投票：1回答：1

1个回答

最新问题

如何优化Cortex-M3的滤波器环路？

问题描述 投票：1回答：1

1个回答

最新问题

问题描述投票：1回答：1