lokathor / arm7tdmi_aeabi Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1.19 MB

Memory functions for the ARM7TDMI.

Home Page: https://lokathor.github.io/arm7tdmi_aeabi/arm7tdmi_aeabi/

License: Creative Commons Zero v1.0 Universal

Assembly 29.92% Rust 70.08%

arm7tdmi_aeabi's Introduction

Hi, I work on Rust stuff.

arm7tdmi_aeabi's People

Contributors

Watchers

arm7tdmi_aeabi's Issues

64-bit math

Certain 64-bit ops are provided via function:

i64 __aeabi_lmul(i64, i64);
__value_in_regs lldiv_t __aeabi_ldivmod(i64 n, i64 d);
__value_in_regs ulldiv_t __aeabi_uldivmod(u64 n, u64 d);
i64 __aeabi_llsl(i64, i32);
i64 __aeabi_llsr(i64, i32);
i64 __aeabi_lasr(i64, i32);
i32 __aeabi_lcmp(i64, i64);
i32 __aeabi_ulcmp(u64, u64);

i64 __aeabi_ldiv0(i64 return_value);

set up CI testing using cross-rs

divmod is probably slower than needs be

we've got code for div which is "as fast as possible" and code for divmod which always runs 32 loops (it works bit by bit every time).

our div code doesn't directly compute the remainder, but once we've got the quotent we can multiply by the divisor and subtract that from the numerator to get the remainder. this is probably faster in most cases.

Unaligned memory access

int __aeabi_uread4(void *address);
int __aeabi_uwrite4(int value, void *address);
long long __aeabi_uread8(void *address);
long long __aeabi_uwrite8(long long value, void *address);

after a block copy i think we can delete the +32

eg:

  .L_done_with_block_copy:
    tst     r2, #(1<<4)
    ldmdbne r1!, {r3, r12}
    stmdbne r0!, {r3, r12}
    ldmdbne r1!, {r3, r12}
    stmdbne r0!, {r3, r12}
    lsls    r3, r2, #29
    ldmdbcs r1!, {r3, r12}
    stmdbcs r0!, {r3, r12}
    ldrmi   r3, [r1, #-4]
    strmi   r3, [r0, #-4]
    bx      lr
  .L_block_copy_sub:
    push    {r4-r9}
  1:
    subs    r2, r2, #32
    ldmdbcs r1!, {r3-r9, r12}
    stmdbcs r0!, {r3-r9, r12}
    bgt     1b
    pop     {r4-r9}
    bxeq    lr
    adds    r2, r2, #32 @@@@@@@@ THIS CAN GO AWAY
    b       .L_done_with_block_copy

Our newer style "less than 8 words" code (currently only used in the reverse loop) just looks directly at the bits, which won't change with or without the +32, same as we can skip the +4 when we overshoot 0 with single word copying and have to do an extra halfword or byte.

make the lib a proc-macro so callers can include the code in a section they like.

something like

put_mem_fns_in!(".iwram.text");

And then the proc-macro expands to the correct global_asm! invocation with the assembly inlined into a big string literal thing.

division

int __aeabi_idiv(int numerator, int denominator);
unsigned __aeabi_uidiv(unsigned numerator, unsigned denominator);

typedef struct { int quot; int rem; } idiv_return;
typedef struct { unsigned quot; unsigned rem; } uidiv_return;

__value_in_regs idiv_return __aeabi_idivmod(int numerator, int denominator);
__value_in_regs uidiv_return __aeabi_uidivmod(unsigned numerator, unsigned denominator);

int __aeabi_idiv0(int return_value);
long long __aeabi_ldiv0(long long return_value);

Unaligned ops should use t32 code

They're very basic code and don't use conditionals for anything, so they can just be thumb code to save a hair of space.

memmove poor performance when unaligned and Dest > Src

If the destination pointer is greater than the source pointer

              Dest
  0  1  2  3  4  5  6  7
  Src

and also the pointers are unaligned, then currently we conservatively do a reverse byte copy for the entire thing.

This is very poor.

We could try to detect if there's not actually any overlap, and then switch to a forward copy. However, the caller would probably have called memcpy instead of memmove if there's no overlap, so that's a real long shot.
We could try to reverse copy for only the overlapping portion, and then forward copy the rest. Depending on the amount of overlap, this could give significant improvements. The less overlap, the better the improvement.

non-block forward and backward don't agree on using subs and early return or bit testing

maybe they're too close to say it makes a big difference if we use subtraction-return or bit testing, but at least the forward and backward loops should use the same style of control flow, whichever style we pick.

z__aeabi_memcpy_vram is poor

The z__aeabi_memcpy_vram function should be a private symbol.

What someone would really want is more like z__aeabi_memcpy2, but that's not quite what this is doing.

we should just implement z__aeabi_memcpy2 and z__aeabi_memmove2 for consistency.

lokathor / arm7tdmi_aeabi Goto Github PK

arm7tdmi_aeabi's Introduction

arm7tdmi_aeabi's People

Contributors

Watchers

arm7tdmi_aeabi's Issues

64-bit math

set up CI testing using cross-rs

divmod is probably slower than needs be

Unaligned memory access

after a block copy i think we can delete the +32

make the lib a proc-macro so callers can include the code in a section they like.

division

Unaligned ops should use t32 code

memmove poor performance when unaligned and Dest > Src

non-block forward and backward don't agree on using subs and early return or bit testing

z__aeabi_memcpy_vram is poor

proc-macro needs docs

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent