FPGARelated.com
Forums

LatticeMico32 extremly poor performance without caches

Started by Antti October 2, 2006
Hi

just some results for LatticeMico32:
* no cache
* code and data in Block RAMs

testing with software loop

 sw r0,r0,0x100
 bri -1

this loop executes in 28 system clock cycles!

simulation done with Xilinx ISE built-in simulator ISIM,
using coregen for addsub and block RAM components.

Antti
PS as much as I see Lattice is at time of writing violating GPL license
or does anyone know where to download the GPL licensed source code of
the LatticeMico32 GNU toolchain !?

Hi Antti,

have you any idea why it is that slow? Branch penalty? Or is the write that 
slow? Does the performance improve with caches on? (I have not looked 
closely at Mico32 yet, maybe it is intended to only be used with caches?)

Regarding the GPL: I think it is sufficient if they clearly say that the 
softare is GPL-licensed and if they provide you the source-code on request. 
So they would be only violating the license if you ask them to provide you 
the source-code and they say "No". Once you have the source-code, you are 
free to publish it yourself on a web side (I am sure, you will ;-)

Thomas

www.entner-electronics.com

"Antti" <Antti.Lukats@xilant.com> schrieb im Newsbeitrag 
news:1159784088.316718.15260@i42g2000cwa.googlegroups.com...
> Hi > > just some results for LatticeMico32: > * no cache > * code and data in Block RAMs > > testing with software loop > > sw r0,r0,0x100 > bri -1 > > this loop executes in 28 system clock cycles! > > simulation done with Xilinx ISE built-in simulator ISIM, > using coregen for addsub and block RAM components. > > Antti > PS as much as I see Lattice is at time of writing violating GPL license > or does anyone know where to download the GPL licensed source code of > the LatticeMico32 GNU toolchain !? >
> > have you any idea why it is that slow? Branch penalty? Or is the write that > slow? Does the performance improve with caches on? (I have not looked > closely at Mico32 yet, maybe it is intended to only be used with caches?)
You really do need to enable caches if you want high performance, then you should be able to get near single cycle execution (i.e. considering branch penalties, cache refills etc). Cheers, Jon
Jon Beniston schrieb:

> > > > have you any idea why it is that slow? Branch penalty? Or is the write that > > slow? Does the performance improve with caches on? (I have not looked > > closely at Mico32 yet, maybe it is intended to only be used with caches?) > > You really do need to enable caches if you want high performance, then > you should be able to get near single cycle execution (i.e. considering > branch penalties, cache refills etc). > > Cheers, > Jon
well if the only memories are on chip Block RAMs then caches should not be needed, for Xilinx MicroBlaze LMB and PPC OCM buses the BRAMs work like always_hit cache memories. on LM32 all memories are on Wishbone bus making the access to BRAM based memory block slower than the access to external memory (assuming cache hit). LM32 does fit nicely into small XP devices like XP3, but only without cache so the requirement to have always caches to achive any normal clock-per cycle ration seems like severly limiting factor for LM32 OpenFire (opensource MicroBlaze clone) would run in Lattice silicon way faster then then LM32 (if execution from on-chip memory is compared) Antti
Antti.
> > well if the only memories are on chip Block RAMs then caches should not > be needed, for Xilinx MicroBlaze LMB and PPC OCM buses the BRAMs work > like always_hit cache memories. > > on LM32 all memories are on Wishbone bus making the access to BRAM > based memory block slower than the access to external memory (assuming > cache hit).
This is why it is relatively slow.
> LM32 does fit nicely into small XP devices like XP3, but only without > cache so the requirement to have always caches to achive any normal > clock-per cycle ration seems like severly limiting factor for LM32
If you look through the RTL, there is some support for this. I'm sure Lattice will enable it via the GUI in a latter version. Cheers, Jon
Jon Beniston schrieb:

> Antti. > > > > well if the only memories are on chip Block RAMs then caches should not > > be needed, for Xilinx MicroBlaze LMB and PPC OCM buses the BRAMs work > > like always_hit cache memories. > > > > on LM32 all memories are on Wishbone bus making the access to BRAM > > based memory block slower than the access to external memory (assuming > > cache hit). > > This is why it is relatively slow. > > > LM32 does fit nicely into small XP devices like XP3, but only without > > cache so the requirement to have always caches to achive any normal > > clock-per cycle ration seems like severly limiting factor for LM32 > > If you look through the RTL, there is some support for this. I'm sure > Lattice will enable it via the GUI in a latter version. > > Cheers, > Jon
lets hope the local-memory interface will be available and documented without it (and no cache) the performance is really bad. I have it now running in Virtex-4 doing a maximum speed loop incrementing a register and writing it to GPIO complete program as .COE for Xilinx coregen: memory_initialization_radix=16; memory_initialization_vector= 98000000, B8000800, 34210001, 5801E000, E3FFFFFE, 2800E000; xor r0,r0,r0 mv r1,r0 addi r1,r1,1 sw (r0+0xE000),r1 ; this is short store to GPIO base bi -2 This loop emits 181KHz on GPIO(0) at 12MHz system clock, so for 100Mhz clock the max IO toggle rate would be 1.5MHz :( Antti
>on LM32 all memories are on Wishbone bus making the access to BRAM >based memory block slower than the access to external memory (assuming >cache hit).
Can you cheat and build a system with cache and no memory, then preload the cache with the data you want? -- The suespammers.org mail server is located in California. So are all my other mailboxes. Please do not send unsolicited bulk e-mail or unsolicited commercial e-mail to my suespammers.org address or any of my other addresses. These are my opinions, not necessarily my employer's. I hate spam.
"Hal Murray" <hmurray@suespammers.org> schrieb im Newsbeitrag 
news:F4idnX0B6paUobzYnZ2dnUVZ_vSdnZ2d@megapath.net...
> >>on LM32 all memories are on Wishbone bus making the access to BRAM >>based memory block slower than the access to external memory (assuming >>cache hit). > > Can you cheat and build a system with cache and no memory, then > preload the cache with the data you want? > > --
not sure, this approuch works nicely for Virtex PPC caches, but for LM32 guess it needs deep look into the RTL code to see if it is possible option or not. currently I disabled the caches while the use some function in an way that is not supported by ISE (and I was too lazy-busy to fix it), and actually I wantes to have resource useage numbers for minimal setup (eg no cache system) anyway adding direct memory is of course possible as all RTL is available (and supposedly has at least partial support there) but lets see if there will be some updates to the LM32 release Antti
Hal Murray wrote:
> >on LM32 all memories are on Wishbone bus making the access to BRAM > >based memory block slower than the access to external memory (assuming > >cache hit). > > Can you cheat and build a system with cache and no memory, then > preload the cache with the data you want?
As far as I can see, I don't think this is supported via the GUI yet, but would of course be possible if you hacked the RTL to change the cache memories and tags to be initialised with the correct data, which should be fairly straightforward to do. However, this would not be as efficient as using the instruction ROM and data RAM that are in the RTL, as you end up wasting resources on memories for the cache tag RAMs which aren't needed, and all of the cache refill logic etc.. Cheers, Jon
Jon Beniston schrieb:

> Hal Murray wrote: > > >on LM32 all memories are on Wishbone bus making the access to BRAM > > >based memory block slower than the access to external memory (assuming > > >cache hit). > > > > Can you cheat and build a system with cache and no memory, then > > preload the cache with the data you want? > > As far as I can see, I don't think this is supported via the GUI yet, > but would of course be possible if you hacked the RTL to change the > cache memories and tags to be initialised with the correct data, which > should be fairly straightforward to do. > > However, this would not be as efficient as using the instruction ROM > and data RAM that are in the RTL, as you end up wasting resources on > memories for the cache tag RAMs which aren't needed, and all of the > cache refill logic etc.. > > Cheers, > Jon
hm, I guess the LM32_RAM that is included when JTAG debugger is enabled is direct processor connected block RAM. as I had JTAG configured off this module was also out, so I only had wishbone block RAMs left in the system. For the custom instruction support there is anyway an update needed so I guess the access to direct CPU connected on chip memories will also be available then. Antti PS for those who want to play with LatticeMico32 on Xilinx platform I uploaded the ISE project navigator project that I used for testing, its rather minimal LM32 system tested to work in Virtex-4, available @ www.microfpga.com , download area...