对于Cortex-M3，如何优化块复制和右移+饱和到max = 5

Question

基本上，我需要通过减少整体代码的大小来减少内存大小或使其运行更有效，从而提高这段代码的效率。我使用的是Thumb 2以及Cortex-M3。

我已经尝试减少使用的MOV函数的数量但是虽然这确实减少了整体代码大小，但由于代码的工作方式，它需要每个单独的部分将结果存储并存储在寄存器中，所以我有点难过我怎么能改进它。目前代码处于默认状态。

THUMB
  AREA RESET, CODE, READONLY
  EXPORT  __Vectors
  EXPORT Reset_Handler
__Vectors 
  DCD 0x00180000     ; top of the stack 
  DCD Reset_Handler  ; reset vector - where the program starts

  AREA 2a_Code, CODE, READONLY
Reset_Handler
  ENTRY

num_words EQU (end_source-source)/4  ; number of words to copy

start   
  LDR r0,=source     ; point to the start of the area of memory to copy from
  LDR r1,=dest       ; point to the start of the area of memory to copy to
  MOV r2,#num_words  ; get the number of words to copy

  ; find out how many blocks of 8 words need to be copied - it is assumed
  ; that it is faster to load 8 data items at a time, rather than load
  ; individually
block
  MOVS r3,r2,LSR #3  ; find the number of blocks of 8 words
  BEQ individ        ; if no blocks to copy, just copy individual words

  ; copy and process blocks of 8 words 
block_loop
  LDMIA r0!,{r5-r12}  ; get 8 words to copy as a block
  MOV r4,r5           ; get first item
  BL data_processing  ; process first item 
  MOV r5,r4           ; keep first item
  MOV r4,r6           ; get second item
  BL data_processing  ; process second item 
  MOV r6,r4           ; keep second item
  MOV r4,r7           ; get third item
  BL data_processing  ; process third item
  MOV r7,r4           ; keep third item  
  MOV r4,r8           ; get fourth item
  BL data_processing  ; process fourth item 
  MOV r8,r4           ; keep fourth item
  MOV r4,r9           ; get fifth item
  BL data_processing  ; process fifth item
  MOV r9,r4           ; keep fifth item  
  MOV r4,r10          ; get sixth item
  BL data_processing  ; process sixth item 
  MOV r10,r4          ; keep sixth item
  MOV r4,r11          ; get seventh item
  BL data_processing  ; process seventh item
  MOV r11,r4          ; keep seventh item 
  MOV r4,r12          ; get eighth item
  BL data_processing  ; process eighth item
  MOV r12,r4          ; keep eighth item  
  STMIA r1!,{r5-r12}  ; copy the 8 words 
  SUBS r3,r3,#1       ; move on to the next block
  BNE block_loop      ; continue until last block reached

  ; there may now be some data items available (fewer than 8)
  ; find out how many of these individual words need to be copied 
individ
  ANDS r3,r2,#7   ; find the number of words that remain to copy individually
  BEQ exit        ; skip individual copying if none remains

  ; copy the excess of words
individ_loop
  LDR r4,[r0],#4      ; get next word to copy
  BL data_processing  ; process the item read
  STR r4,[r1],#4      ; copy the word 
  SUBS r3,r3,#1       ; move on to the next word
  BNE individ_loop    ; continue until the last word reached

  ; languish in an endless loop once all is done
exit    
  B exit

  ; subroutine to scale a value by 0.5 and then saturate values to a maximum of 5 
data_processing
  CMP r4,#10           ; check whether saturation is needed
  BLT divide_by_two    ; if not, just divide by 2
  MOV r4,#5            ; saturate to 5
  BX lr
divide_by_two
  MOV r4,r4,LSR #1     ; perform scaling
  BX lr

  AREA 2a_ROData, DATA, READONLY
source  ; some data to copy
  DCD 1,2,3,4,5,6,7,8,9,10,11,0,4,6,12,15,13,8,5,4,3,2,1,6,23,11,9,10 
end_source

  AREA 2a_RWData, DATA, READWRITE
dest  ; copy to this area of memory
  SPACE end_source-source
end_dest
  END

基本上我需要代码将结果存储在每个寄存器中，同时还要减小尺寸或使其更快地执行。谢谢你的帮助。

Answer 1

这是主循环的略微优化版本。鉴于您正在为Cortex M3编程，因为您的CPU不支持超级标量或SIMD处理，因此没有实际可能性。这与您的代码之间的主要区别是：

所有相关功能都内联
逻辑已经优化了一点点
无用的移动指令已被省略

此代码在每个表条目中运行10个周期，加上初始分支的一些指令加上最终的分支误预测。

        .syntax unified
        .thumb

        @ r0: source
        @ r1: destination
        @ r2: number of words to copy
        @ the number in front of the comment is the number
        @ of cycles needed to execute the instruction
block:  cbz r2, .Lbxlr          @ 2 return if nothing to copy

.Loop:  ldmia r0!, {r3}         @ 2 load one item from source
        cmp r3, #10             @ 1 need to scale?
        ite lt                  @ 1 if r3 < 10:
        lsrlt r3, r3, #1        @ 1 then r3 >>= 1
        movge r3, #5            @ 1 else r3 = 5
        stmia r1!, {r3}         @ 2 store to destination
        subs r2, r2, #1         @ 1 decrement #words
        bne .Loop               @ 1 continue if not done yet

.Lbxlr: bx lr

通过展开循环一次，您可以将两个条目（每个条目8个循环）降低到16个循环。请注意，这几乎使代码长度增加了三倍，只获得了很小的性能提升。

        .syntax unified
        .thumb

        @ r0: source
        @ r1: destination
        @ r2: number of words to copy
        @ the number in front of the comment is the number
        @ of cycles needed to execute the instruction

        @ first check if the number of elements is even or odd
        @ leave this out if it's know to be even
block:  tst r2, #1              @ 1 odd number of entries to copy?
        beq .Leven              @ 2 if not, proceed with eveness check

        ldmia r0!, {r3}         @ 2 load one item from source
        cmp r3, #10             @ 1 need to scale?
        ite lt                  @ 1 if r3 < 10:
        lsrlt r3, r3, #1        @ 1 then r3 >>= 1
        movge r3, #5            @ 1 else r3 = 5
        stmia r1!, {r3}         @ 2 store to destination
        subs r2, r2, #1         @ 1 decrement #words

        @ check if any elements are left
        @ remove if you know that at least two elements are present
.Leven: cbz r2, .Lbxlr          @ 2 return if no entries left.

.Loop:  ldmia r0!, {r3, r4}     @ 3 load two items from source

        cmp r3, #10             @ 1 need to scale?
        ite lt                  @ 1 if r3 < 10:
        lsrlt r3, r3, #1        @ 1 then r3 >>= 1
        movge r3, #5            @ 1 else r3 = 5

        cmp r4, #10             @ 1 need to scale?
        ite lt                  @ 1 if r5 < 10:
        lsrlt r4, r4, #1        @ 1 then r4 >>= 1
        movge r4, #5            @ 1 else r4 = 5

        stmia r1!, {r3, r4}     @ 3 store to destination
        subs r2, r2, #2         @ 1 decrement #words twice
        bne .Loop               @ 1 continue if not done yet

.Lbxlr: bx lr

通过将循环展开四次可以实现每个元素7个循环，但我认为这是过度的。

请注意，此代码在GNU中作为语法。为汇编程序修改它应该是微不足道的。

对于Cortex-M3，如何优化块复制和右移+饱和到max = 5

问题描述投票：2回答：1

1个回答

最新问题

对于Cortex-M3，如何优化块复制和右移+饱和到max = 5

问题描述 投票：2回答：1

1个回答

最新问题

问题描述投票：2回答：1