FPGARelated.com
Forums

Could someone tell me NIOS II/MB performance on this benchmark?

Started by Tommy Thorn April 29, 2008
I trying to get a feel for how the performance of my (so far
unoptimized) soft-core stacks up against the established competition,
so it would be a great help if people with convenient access to Nios
II / MicroBlaze respectively would compile and time this little app:
http://radagast.se/othello/endgame.c (It's an Othello endgame solver.
I didn't write it) and tell me the configuration.

In case anyone cares, mine finished this in 100 seconds in this
configuration: 8 KiB I$, 16 KiB D$, 48 MHz clock frequency, async
sram. (My Mac finished this in ~ 0.5 sec :-)

Thanks
Tommy
Hi,

I did a quick test with MicroBlaze.
With 125 MHz and  64kbyte of local memory, it takes MicroBlaze 6.8s to run 
the benchmark.

I added two defines in the program.
#define printf   xil_printf
#define double float
The first define is to get a smaller code footprint since the default printf 
is bloated and no floating-point is printed.
The second define will make the compiler to use the MicroBlaze FPU 
single-precision floating-point compare and conversion instructions.
Neither defines will change the program result since there is no actual 
floating-point calculations, just compare and conversions.

Actually the program prints out a relative large number of characters and if 
I remove the printf statement that is part of the loop, the program executes 
in 6.1 s
The baudrate will have an effect on the execution speed if too many prints 
exists in the timed section.

G�ran

"Tommy Thorn" <tommy.thorn@gmail.com> wrote in message 
news:f005305a-30b9-4ca2-ae01-7fd3e2622853@l17g2000pri.googlegroups.com...
>I trying to get a feel for how the performance of my (so far > unoptimized) soft-core stacks up against the established competition, > so it would be a great help if people with convenient access to Nios > II / MicroBlaze respectively would compile and time this little app: > http://radagast.se/othello/endgame.c (It's an Othello endgame solver. > I didn't write it) and tell me the configuration. > > In case anyone cares, mine finished this in 100 seconds in this > configuration: 8 KiB I$, 16 KiB D$, 48 MHz clock frequency, async > sram. (My Mac finished this in ~ 0.5 sec :-) > > Thanks > Tommy
Hi,

Actually the use of floating-point at all seems unnecessary in the program.
Think this is a legacy of PC program where the usage of double (or float) is 
not performance critical as on CPU without a FPU.

I think it's safe to change to double in the program to int without any 
changes in result.
The program would not run faster on a MAC/PC with this change but it will 
have a drastic effect on your CPU.

G&#4294967295;ran

"G&#4294967295;ran Bilski" <goran.bilski@xilinx.com> wrote in message 
news:fv70te$7s01@cnn.xsj.xilinx.com...
> Hi, > > I did a quick test with MicroBlaze. > With 125 MHz and 64kbyte of local memory, it takes MicroBlaze 6.8s to run > the benchmark. > > I added two defines in the program. > #define printf xil_printf > #define double float > The first define is to get a smaller code footprint since the default > printf is bloated and no floating-point is printed. > The second define will make the compiler to use the MicroBlaze FPU > single-precision floating-point compare and conversion instructions. > Neither defines will change the program result since there is no actual > floating-point calculations, just compare and conversions. > > Actually the program prints out a relative large number of characters and > if I remove the printf statement that is part of the loop, the program > executes in 6.1 s > The baudrate will have an effect on the execution speed if too many prints > exists in the timed section. > > G&#4294967295;ran > > "Tommy Thorn" <tommy.thorn@gmail.com> wrote in message > news:f005305a-30b9-4ca2-ae01-7fd3e2622853@l17g2000pri.googlegroups.com... >>I trying to get a feel for how the performance of my (so far >> unoptimized) soft-core stacks up against the established competition, >> so it would be a great help if people with convenient access to Nios >> II / MicroBlaze respectively would compile and time this little app: >> http://radagast.se/othello/endgame.c (It's an Othello endgame solver. >> I didn't write it) and tell me the configuration. >> >> In case anyone cares, mine finished this in 100 seconds in this >> configuration: 8 KiB I$, 16 KiB D$, 48 MHz clock frequency, async >> sram. (My Mac finished this in ~ 0.5 sec :-) >> >> Thanks >> Tommy > >
Thanks G=F6ran,

that's very impressive. You are right about the double precision, and
output. With the below patch applied, I now clock in at 42.5 s. Could
you try it again (I assume your numbers were with floats).

Using local memory however doesn't make for an apples to apples
comparison as this benchmark is memory heavy and local memory (as
opposed to cache + slow memory) will give MB a large advantage.

Thanks
Tommy
PS: Which FPGA was this on?



On Apr 29, 5:31 am, "G=F6ran Bilski" <goran.bil...@xilinx.com> wrote:
> Hi, > > Actually the use of floating-point at all seems unnecessary in the program=
.
> Think this is a legacy of PC program where the usage of double (or float) =
is
> not performance critical as on CPU without a FPU. > > I think it's safe to change to double in the program to int without any > changes in result. > The program would not run faster on a MAC/PC with this change but it will > have a drastic effect on your CPU. > > G=F6ran > > "G=F6ran Bilski" <goran.bil...@xilinx.com> wrote in message > > news:fv70te$7s01@cnn.xsj.xilinx.com... > > > Hi, > > > I did a quick test with MicroBlaze. > > With 125 MHz and 64kbyte of local memory, it takes MicroBlaze 6.8s to r=
un
> > the benchmark. > > > I added two defines in the program. > > #define printf xil_printf > > #define double float > > The first define is to get a smaller code footprint since the default > > printf is bloated and no floating-point is printed. > > The second define will make the compiler to use the MicroBlaze FPU > > single-precision floating-point compare and conversion instructions. > > Neither defines will change the program result since there is no actual > > floating-point calculations, just compare and conversions. > > > Actually the program prints out a relative large number of characters an=
d
> > if I remove the printf statement that is part of the loop, the program > > executes in 6.1 s > > The baudrate will have an effect on the execution speed if too many prin=
ts
> > exists in the timed section. > > > G=F6ran > > > "Tommy Thorn" <tommy.th...@gmail.com> wrote in message > >news:f005305a-30b9-4ca2-ae01-7fd3e2622853@l17g2000pri.googlegroups.com...=
> >>I trying to get a feel for how the performance of my (so far > >> unoptimized) soft-core stacks up against the established competition, > >> so it would be a great help if people with convenient access to Nios > >> II / MicroBlaze respectively would compile and time this little app: > >>http://radagast.se/othello/endgame.c(It's an Othello endgame solver. > >> I didn't write it) and tell me the configuration. > > >> In case anyone cares, mine finished this in 100 seconds in this > >> configuration: 8 KiB I$, 16 KiB D$, 48 MHz clock frequency, async > >> sram. (My Mac finished this in ~ 0.5 sec :-) > > >> Thanks > >> Tommy
Forgot the patch. I'm sure Google Groups will mangle it for me.

Tommy


diff --git a/testcases/demos/smith-weill-gunnar-endgame.c b/testcases/
demos/smith-weill-gunnar-endgame.c
index 55f02d5..55a92db 100644
--- a/testcases/demos/smith-weill-gunnar-endgame.c
+++ b/testcases/demos/smith-weill-gunnar-endgame.c
@@ -168,2 +168,4 @@ additional 1.5 or so.

+#define double long
+
 /* #define WINDOWS_TIMING */
@@ -989,3 +991,3 @@ int main( void ){
       }
-      printf("%3d (emp=%2d wc=%2d bc=%2d) %s\n",
+      if (0) printf("%3d (emp=%2d wc=%2d bc=%2d) %s\n",
          val,            emp,wc,bc,            bds[i]         );

Hi Tommy,

It depends how you want to benchmark, only using features that your CPU has? 
(lacking large local memory).
The code footprint when using optimized printf is around 50k with data.
Using a processor with 8kbyte dcache and 16kbyte dcache on an application 
that is just twice the size dont seems to be valid.
Cache effiencies is more likely to show when you have at least a 10-50x 
factor between cache size and code size.
Also using cache will also include the external memory type and memory 
controller in the benchmark numbers. I guess they are not apples to apples 
between you and me.
Using fast async sram as the external memory is not the same as using SDRAM.

Yes, my results was with using float instead of double, I don't think you 
need to set the type to long since the values seems to be well within a 
byte.

I took my board connected to my laptop, which is a ML505 (Virtex5 slowest 
speedgrade) and I didn't pushed the clock frequency.

G&#4294967295;ran

"Tommy Thorn" <tommy.thorn@gmail.com> wrote in message 
news:0d6ce282-f79a-4dd2-b968-0af4ae735aba@1g2000prg.googlegroups.com...
Thanks G&#4294967295;ran,

that's very impressive. You are right about the double precision, and
output. With the below patch applied, I now clock in at 42.5 s. Could
you try it again (I assume your numbers were with floats).

Using local memory however doesn't make for an apples to apples
comparison as this benchmark is memory heavy and local memory (as
opposed to cache + slow memory) will give MB a large advantage.

Thanks
Tommy
PS: Which FPGA was this on?



On Apr 29, 5:31 am, "G&#4294967295;ran Bilski" <goran.bil...@xilinx.com> wrote:
> Hi, > > Actually the use of floating-point at all seems unnecessary in the > program. > Think this is a legacy of PC program where the usage of double (or float) > is > not performance critical as on CPU without a FPU. > > I think it's safe to change to double in the program to int without any > changes in result. > The program would not run faster on a MAC/PC with this change but it will > have a drastic effect on your CPU. > > G&#4294967295;ran > > "G&#4294967295;ran Bilski" <goran.bil...@xilinx.com> wrote in message > > news:fv70te$7s01@cnn.xsj.xilinx.com... > > > Hi, > > > I did a quick test with MicroBlaze. > > With 125 MHz and 64kbyte of local memory, it takes MicroBlaze 6.8s to > > run > > the benchmark. > > > I added two defines in the program. > > #define printf xil_printf > > #define double float > > The first define is to get a smaller code footprint since the default > > printf is bloated and no floating-point is printed. > > The second define will make the compiler to use the MicroBlaze FPU > > single-precision floating-point compare and conversion instructions. > > Neither defines will change the program result since there is no actual > > floating-point calculations, just compare and conversions. > > > Actually the program prints out a relative large number of characters > > and > > if I remove the printf statement that is part of the loop, the program > > executes in 6.1 s > > The baudrate will have an effect on the execution speed if too many > > prints > > exists in the timed section. > > > G&#4294967295;ran > > > "Tommy Thorn" <tommy.th...@gmail.com> wrote in message > >news:f005305a-30b9-4ca2-ae01-7fd3e2622853@l17g2000pri.googlegroups.com... > >>I trying to get a feel for how the performance of my (so far > >> unoptimized) soft-core stacks up against the established competition, > >> so it would be a great help if people with convenient access to Nios > >> II / MicroBlaze respectively would compile and time this little app: > >>http://radagast.se/othello/endgame.c(It's an Othello endgame solver. > >> I didn't write it) and tell me the configuration. > > >> In case anyone cares, mine finished this in 100 seconds in this > >> configuration: 8 KiB I$, 16 KiB D$, 48 MHz clock frequency, async > >> sram. (My Mac finished this in ~ 0.5 sec :-) > > >> Thanks > >> Tommy