Reply by Martin February 16, 20092009-02-16
Hi all

My goal was to transfer an Assembler implementation that was running
on an ARM chip
to the PowerPC 405 architecture on a Xilinx Virtex II Pro. In the
meantime this is working but for some reason the PowerPC
implementation is more than 10 times!!! slower than the ARM chip
implementation. I can only think of
3 reasons:

1) In the ARM implementation I save 16 registers on the stack, whereas
I save 32 registers when working with the
PowerPC.

 mflr  0               // save register and set up the stack frame
 stw   0, 4(1)
 addi  1, 1, -124
 stmw  3, 8(1)
  // do some stuff
 lmw   3, 8(1)      // restore registers and destroy the stack frame
 addi  1, 1, 124
 lwz   0, 4(1)
 mtlr  0

This is the way how I set up stack frames and destroyed them in
routine calls. Shouldnt really have a bad
impact on the performance?

2) The multiplier in the PowerPC architecture. The ARM multiplier has
a latency of 5 instructions, but if I am not completly
wrong also the multiplier in the PowerPC 405 has also a latency of 5
clock cycles so this should not be the issue?

3) Clock Frequency: I used the EDK BaseSystem Builder where the
external clock is 24 MHz and I told the tool that
I also wanna have the PowerPC running at this clock frequency. This
seems to be the most likely source for the problem
 which I will have to check now.

However, if anybody has an other idea why the PowerPC implementation
is so slow I would be thankful for some hints.

Thanks!