.L_done_with_block_copy:
tst r2, #(1<<4)
ldmdbne r1!, {r3, r12}
stmdbne r0!, {r3, r12}
ldmdbne r1!, {r3, r12}
stmdbne r0!, {r3, r12}
lsls r3, r2, #29
ldmdbcs r1!, {r3, r12}
stmdbcs r0!, {r3, r12}
ldrmi r3, [r1, #-4]
strmi r3, [r0, #-4]
bx lr
.L_block_copy_sub:
push {r4-r9}
1:
subs r2, r2, #32
ldmdbcs r1!, {r3-r9, r12}
stmdbcs r0!, {r3-r9, r12}
bgt 1b
pop {r4-r9}
bxeq lr
adds r2, r2, #32 @@@@@@@@ THIS CAN GO AWAY
b .L_done_with_block_copy
Our newer style "less than 8 words" code (currently only used in the reverse loop) just looks directly at the bits, which won't change with or without the +32, same as we can skip the +4 when we overshoot 0 with single word copying and have to do an extra halfword or byte.