FPGARelated.com
Forums

Nios performance

Started by Piotr Wyderski February 18, 2005
Hello,

how fast a Nios processor can be if embedded
in a speed grade 6 Cyclone FPGA? What is the
approximate maximum reachable clock frequency?

    Best regards
    Piotr Wyderski

"Piotr Wyderski" <wyderskiREMOVE@ii.uni.wroc.pl> wrote in message 
news:cv63gb$to$1@news.dialog.net.pl...
> Hello, > > how fast a Nios processor can be if embedded > in a speed grade 6 Cyclone FPGA? What is the > approximate maximum reachable clock frequency? > > Best regards > Piotr Wyderski >
If you're asking how fast your application can run then you better test it on an actual Cyclone. You can get a Nios to run 140MHz out of onchip ram, but each memory access will cost you at least 5 clocks. Running out of sdram can be had at 110MHz, but memory accesses will set you back 11+ clocks per. Also, watch out for bit shifts. They can take 1 clock per bit on the Cyclone. To overcome these limitations we upgraded our application to a StratixI to get 1 clock bit shifts and wrote a Custom Instruction to bypass the Avalon Bus and read mem in 2 clocks. (writes are always fast) Now our app cooks. Ken
Kenneth Land wrote:

> If you're asking how fast your application can run then you better test it > on an actual Cyclone.
Unfortunately, I have no working hardware yet, so I cannot test it. To clarify my requirements: I am going to perform digital signal processing on a data stream. There is no need for rapid control flow branching, everything can be pipelined. 16--20 bit fixed-point arithmetic is enough. I need a FIR block, as fast as possible, and an interface to SDRAM, only capable of performing pseudo-DMA burst transfers (say, one M4K RAM block at a time). The NIOS CPU will be responsible for quite simple things: synchronization of the hardware "coprocessors" (the FIR block, a TDMA interface to an AC-97 codec, NCO, parallel interface to a USB2.0 coupler etc.), filter coefficient computation (off-line, can be slow), CF card data transfer handling. I wolud like to perform some complex operations using NIOS, e.g. div, sqrt etc., so the CPU should be as fast as possible, but not faster -- there's no need for a Stratix device. :-) No direct communication between NIOS and SDRAM is necessary, a user-controlled "data cache" in the internal memory banks is enough. BTW, I'd like to connect a 16Mbit 70ns parallel (1M x 16) FLASH memory chip to the FPGA. Where should I connect it? Can it share the lines with the SDRAM interface (+ a simple address decoder)?
> You can get a Nios to run 140MHz out of onchip ram, but each > memory access will cost you at least 5 clocks. Running out of > sdram can be had at 110MHz, but memory accesses will set you > back 11+ clocks per.
But it is fully pipelined, isn't it? Best regards Piotr Wyderski
"Piotr Wyderski" <wyderskiREMOVE@ii.uni.wroc.pl> wrote in message 
news:cv68m5$48l$1@news.dialog.net.pl...
> Kenneth Land wrote: > >> If you're asking how fast your application can run then you better test >> it >> on an actual Cyclone. > > Unfortunately, I have no working hardware yet, so I cannot > test it. To clarify my requirements: I am going to perform > digital signal processing on a data stream. There is no need > for rapid control flow branching, everything can be pipelined. > 16--20 bit fixed-point arithmetic is enough. I need a FIR > block, as fast as possible, and an interface to SDRAM, > only capable of performing pseudo-DMA burst transfers > (say, one M4K RAM block at a time). The NIOS CPU will be > responsible for quite simple things: synchronization of the > hardware "coprocessors" (the FIR block, a TDMA interface to > an AC-97 codec, NCO, parallel interface to a USB2.0 coupler > etc.), filter coefficient computation (off-line, can be slow), > CF card data transfer handling. I wolud like to perform > some complex operations using NIOS, e.g. div, sqrt etc., > so the CPU should be as fast as possible, but not faster > -- there's no need for a Stratix device. :-) > > No direct communication between NIOS and SDRAM is > necessary, a user-controlled "data cache" in the internal > memory banks is enough. > > BTW, I'd like to connect a 16Mbit 70ns parallel (1M x 16) FLASH > memory chip to the FPGA. Where should I connect it? Can it > share the lines with the SDRAM interface (+ a simple address > decoder)? > >> You can get a Nios to run 140MHz out of onchip ram, but each >> memory access will cost you at least 5 clocks. Running out of >> sdram can be had at 110MHz, but memory accesses will set you >> back 11+ clocks per. > > But it is fully pipelined, isn't it? > > Best regards > Piotr Wyderski >
dma's with the supplied SOPC dma peripheral is fast. It's latency aware and we can dma in and out of it in as fast as one clock per word. The data master of the Nios is not latency aware and even hand assembly coded back to back reads to sequential addresses in sdram will be no faster than 11+ clocks per read. So heres the answer: dma's, and all writes are fast. Avoid non-dma reads whenever possible. Originally for my app we dma'd an external fifo into sdram buffers for processing. We got that xfer down to one clock per 32 bit word, but the 11 clocks to get each word out to process it killed us. The solution was a custom instruction that can read the fifo in 2 clocks, process it, then write the result to an sdram buffer. (remember writes are fast) buffers are eventually dma'd from sdram to USB2.0 controller. Hope this helps. Ken
Hello Piotr,

With a 1C20, speedgrade 6 and Quartus physical synthesis, I achieve 116 MHz 
(fast-fit in contrast: 92MHz), with speedgrade 8 (a bit cheaper...) this 
drops to 89 MHz (typical design, with SDRAM-controller). The real fmax of 
course depends on your design, e.g. which periperals you are using, how full 
your chip is, etc.

If you need the CPU only for simple control tasks, you might also 
considering to use our ERIC5 (www.entner-electronics.com). However, there is 
no support for fast multiplications and divisons, it is more comparable to a 
ATMEL AVR in performance (but higher fmax).

Regards,

Thomas

"Piotr Wyderski" <wyderskiREMOVE@ii.uni.wroc.pl> schrieb im Newsbeitrag 
news:cv63gb$to$1@news.dialog.net.pl...
> Hello, > > how fast a Nios processor can be if embedded > in a speed grade 6 Cyclone FPGA? What is the > approximate maximum reachable clock frequency? > > Best regards > Piotr Wyderski >
Hello Piotr,

With a 1C20, speedgrade 6 and using Quartus physical synthesis, I achieve 
116 MHz
(using fast-fit, in contrast: 92MHz), with speedgrade 8 (a bit cheaper...) 
this
drops to 89 MHz (typical design, with SDRAM-controller). The real fmax of
course depends on your design, e.g. which periperals you are using, how full
your chip is, etc.

If you need the CPU only for simple control tasks, you might also
considering to use our ERIC5 (www.entner-electronics.com). However, there is
no support for fast multiplications and divisons, it is more comparable to a
ATMEL AVR in performance (but higher fmax).

Regards,

Thomas

P.S.: This is the 2nd try, the first one did not pop-up here. If you get it 
twice, please execuse...

"Piotr Wyderski" <wyderskiREMOVE@ii.uni.wroc.pl> schrieb im Newsbeitrag 
news:cv63gb$to$1@news.dialog.net.pl...
> Hello, > > how fast a Nios processor can be if embedded > in a speed grade 6 Cyclone FPGA? What is the > approximate maximum reachable clock frequency? > > Best regards > Piotr Wyderski >
Kenneth Land wrote:

> The data master of the Nios is not latency aware and even hand assembly > coded back to back reads to sequential addresses in sdram will be no
faster
> than 11+ clocks per read.
That's bad, why hasn't Altera done something with this?
> So heres the answer: dma's, and all writes are fast. > Avoid non-dma reads whenever possible.
I'm going to use intensively the internal RAM blocks.
> Hope this helps.
Of course, thanks! Best regards Piotr Wyderski
"Thomas Entner" <aon.912710880@aon.at> schrieb im Newsbeitrag
news:42176496$0$33864$91cee783@newsreader01.highway.telekom.at...
> Hello Piotr, > > With a 1C20, speedgrade 6 and using Quartus physical synthesis, I achieve > 116 MHz > (using fast-fit, in contrast: 92MHz), with speedgrade 8 (a bit cheaper...) > this > drops to 89 MHz (typical design, with SDRAM-controller). The real fmax of > course depends on your design, e.g. which periperals you are using, how
full
> your chip is, etc. > > If you need the CPU only for simple control tasks, you might also > considering to use our ERIC5 (www.entner-electronics.com). However, there
is
> no support for fast multiplications and divisons, it is more comparable to
a
> ATMEL AVR in performance (but higher fmax).
Hi Thomas, could you please give Quartus resource utilization for ERIC5 when targetting EPM240 and executing from UFM? On your website you claim it would be 50% and that ERIC5 was initially targetted for MAX2. I am just curious to see that report :) Antti PS the two other companies that used to offer 9-Bit processors IP-Cores are now dead and vanished, hope you have better luck!
Hi Antti,

it was very nice to meet you at the Embedded World. I have rechecked my last 
MAXII-ERIC-version (as mentioned, it is not supported in the moment, but 
could be "revitalised"): It needs 120 LCs (50%), when I removed some 
debug-signals, I got even down to 109 (45%). This includes 9bit output-ports 
and 9bit input-ports and NO RAM, just the 3 internal registers. As 
mentioned, the missing RAM blocks of MAXII will reduce the usefulness of a 
CPU in this "CPLD" as RAM is very expensive, even with your tricks.

The core has changed a bit since the MAXII-implemention, so LC-count might 
differ slightly (up or down) if we would redo it. The Cyclone-version needs 
about 110 LCells, for MAXII we can remove the PC of our core, but need to 
add the UFM-flash-interface.

Even if ERIC5 gets no commercial success (how can it, at this pricing ;-), 
we will survive, dont worry... Our main business is camera-design (where we 
use ERIC5 for our own products) and FPGA-design.

Regards,

Thomas
www.entner-electronics.com

P.S.: The exact resource usage summary:
Logic cells 109 / 240 ( 45 % )
Registers 103 / 240 ( 42 % )
Total LABs 16 / 24 ( 66 % )
Logic elements in carry chains 14
User inserted logic cells  0
Virtual pins 0
I/O pins 20 / 80 ( 25 % )
    -- Clock pins  0
Global signals  1
UFM blocks 1 / 1 ( 100 % )
Global clocks 1 / 4 ( 25 % )
Maximum fan-out node clk
Maximum fan-out 104
Total fan-out 601
Average fan-out 4.62

"Antti Lukats" <antti@openchip.org> schrieb im Newsbeitrag 
news:cvc4va$2li$04$1@news.t-online.com...
> "Thomas Entner" <aon.912710880@aon.at> schrieb im Newsbeitrag > news:42176496$0$33864$91cee783@newsreader01.highway.telekom.at... >> Hello Piotr, >> >> With a 1C20, speedgrade 6 and using Quartus physical synthesis, I achieve >> 116 MHz >> (using fast-fit, in contrast: 92MHz), with speedgrade 8 (a bit >> cheaper...) >> this >> drops to 89 MHz (typical design, with SDRAM-controller). The real fmax of >> course depends on your design, e.g. which periperals you are using, how > full >> your chip is, etc. >> >> If you need the CPU only for simple control tasks, you might also >> considering to use our ERIC5 (www.entner-electronics.com). However, there > is >> no support for fast multiplications and divisons, it is more comparable >> to > a >> ATMEL AVR in performance (but higher fmax). > > Hi Thomas, > > could you please give Quartus resource utilization for ERIC5 when > targetting EPM240 and executing from UFM? > On your website you claim it would be 50% and that ERIC5 was > initially targetted for MAX2. I am just curious to see that report :) > > Antti > PS the two other companies that used to offer 9-Bit processors > IP-Cores are now dead and vanished, hope you have better luck! > >
Thomas Entner wrote:

> Hi Antti, > > it was very nice to meet you at the Embedded World. I have rechecked my last > MAXII-ERIC-version (as mentioned, it is not supported in the moment, but > could be "revitalised"): It needs 120 LCs (50%), when I removed some > debug-signals, I got even down to 109 (45%). This includes 9bit output-ports > and 9bit input-ports and NO RAM, just the 3 internal registers. As > mentioned, the missing RAM blocks of MAXII will reduce the usefulness of a > CPU in this "CPLD" as RAM is very expensive, even with your tricks.
That was one of Altera's big mistakes on MAX II, they did not see the opening of state-rom and tiny-cpu coding, or Smart-UART areas, and so designed a part with no RAMs and cripled Code-FLASH....
> The core has changed a bit since the MAXII-implemention, so LC-count might > differ slightly (up or down) if we would redo it. The Cyclone-version needs > about 110 LCells, for MAXII we can remove the PC of our core, but need to > add the UFM-flash-interface.
Another option to consider would be to add support for Serial memory, like Ramtrons FM25x devices. These allow any mix of RAM and CODE, and are variable sized from 4Kb..256Kb, and have 25MHz serial bus speeds. -jg