Hi everyone, I'm trying to optimize the footprint of my firmware on the target device and I realize there are a lot of parameters which might be stored in the embedded RAM instead of dedicated registers. Certainly the RAM access logic will 'eat some space' but lot's of flops will be released. Is there any recommendation on how to optimally use embedded resources? [1] The main reason for this optimization is to free some space to include a function which has been added later in the design phase (ouch!). Thanks a lot, Al [1] I know that put like this this question is certainly open to a hot discussion! :-) -- A: Because it fouls the order in which people normally read text. Q: Why is top-posting such a bad thing? A: Top-posting. Q: What is the most annoying thing on usenet and in e-mail?
embedded RAM vs. registers
Started by ●January 17, 2014
Reply by ●January 17, 20142014-01-17
alb wrote:> Hi everyone, > > I'm trying to optimize the footprint of my firmware on the target device > and I realize there are a lot of parameters which might be stored in the > embedded RAM instead of dedicated registers. > > Certainly the RAM access logic will 'eat some space' but lot's of flops > will be released. Is there any recommendation on how to optimally use > embedded resources? [1] > > The main reason for this optimization is to free some space to include a > function which has been added later in the design phase (ouch!). > > Thanks a lot, > > Al > > [1] I know that put like this this question is certainly open to a hot > discussion! :-) >It depends on the device you're targetting. To some extent the tools can make use of embedded RAM without changing your RTL. For example Xilinx tools allow you to place logic into unused BRAMs, and will automatically infer SRL's where the design allows it. I've often used BRAM as a "shadow memory" to keep a copy of internal configuration registers for readback. That can eliminate a large mux, at least for all register bits that only change when written. Read-only bits and self-resetting bits would still need a mux, but the overall logic could be reduced vs. a complete mux for all bits. -- Gabor P.S. - I find your signature more annoying than top posting. In my opinion the most annoying thing about usenet (besides the text-only format) is people who think they have been appointed to police the ettiquette of other posters.
Reply by ●January 18, 20142014-01-18
Hi Gabor, On 1/17/2014 10:53 PM, GaborSzakacs wrote: []>> I'm trying to optimize the footprint of my firmware on the target device >> and I realize there are a lot of parameters which might be stored in the >> embedded RAM instead of dedicated registers.[]> It depends on the device you're targetting. To some extent the tools > can make use of embedded RAM without changing your RTL. For example > Xilinx tools allow you to place logic into unused BRAMs, and will > automatically infer SRL's where the design allows it.Uhm, apparently the Microsemi devices I'm using (IGLOO), together with the toolset (Libero IDE) are not that smart to profit of the local memory, unless I'm inadvertently asking *not* to use it. To be honest I have not searched deeply for ram usage on these devices, but the handbook does not provide any clue on 'use of RAM without changing RTL'.> I've often used BRAM as a "shadow memory" to keep a copy of internal > configuration registers for readback. That can eliminate a large > mux, at least for all register bits that only change when written. > Read-only bits and self-resetting bits would still need a mux, but > the overall logic could be reduced vs. a complete mux for all bits.I guess I do not completely follow you here, which mux are you talking about? Al p.s.: you are entitled to have your own opinion about Usenet and its users' opinion, no more than I am.
Reply by ●January 18, 20142014-01-18
On 1/18/2014 5:23 PM, alb wrote:> Hi Gabor, > > On 1/17/2014 10:53 PM, GaborSzakacs wrote: > [] >>> I'm trying to optimize the footprint of my firmware on the target device >>> and I realize there are a lot of parameters which might be stored in the >>> embedded RAM instead of dedicated registers. > [] >> It depends on the device you're targetting. To some extent the tools >> can make use of embedded RAM without changing your RTL. For example >> Xilinx tools allow you to place logic into unused BRAMs, and will >> automatically infer SRL's where the design allows it. > > Uhm, apparently the Microsemi devices I'm using (IGLOO), together with > the toolset (Libero IDE) are not that smart to profit of the local > memory, unless I'm inadvertently asking *not* to use it. To be honest I > have not searched deeply for ram usage on these devices, but the > handbook does not provide any clue on 'use of RAM without changing RTL'. > >> I've often used BRAM as a "shadow memory" to keep a copy of internal >> configuration registers for readback. That can eliminate a large >> mux, at least for all register bits that only change when written. >> Read-only bits and self-resetting bits would still need a mux, but >> the overall logic could be reduced vs. a complete mux for all bits. > > I guess I do not completely follow you here, which mux are you talking > about? >In a system with a processor (external or embedded) you typically have some form of bus to read and write registers within the FPGA. Normally you need the outputs of these registers all the time, so you can't just implement the whole thing as RAM. Now if the CPU wants to be able to read back the values it wrote, you need a big readback multiplexer (unless your IGLOO has internal tristate buffers) to select the register you want to read back. What I do is to have a RAM that keeps a copy of what was written by the CPU. Then the readback mux defaults to the output of this (simple single-port) RAM unless the register is read-only or has some side-effects that could change the register's value when it's not being written by the CPU. If you have a design with a whole lot of registers, you can really reduce the size of the readback mux. Of course you could save even more logic by not having readback for values that only change when written by the CPU. These become "write-only" registers, and the software guy then needs to keep his own "shadow" copy of the values he wrote if he needs to read it back later.> Al > > p.s.: you are entitled to have your own opinion about Usenet and its > users' opinion, no more than I am. >Someone said, "Opinions are like a**holes. Everyone has one, and they all stink." In any case I see you removed your signature from the latest post. ;-) -- Gabor
Reply by ●January 20, 20142014-01-20
Hi Gabor, On 1/19/2014 4:30 AM, Gabor wrote: []>> I guess I do not completely follow you here, which mux are you >> talking about? >> > > In a system with a processor (external or embedded) you typically > have some form of bus to read and write registers within the FPGA. > Normally you need the outputs of these registers all the time, so > you can't just implement the whole thing as RAM.I follow you if you talk about 'state registers', which of course are needed to keep the current state of the logic, but there are lots of 'configuration registers' which do not need constant access at their values. A simple example would be the configuration of an UART, you do not need to know *constantly* that you need a parity bit or two stop bits. These type of 'memory' can go in a RAM. Would you agree?> Now if the CPU wants to be able to read back the values it wrote, > you need a big readback multiplexer (unless your IGLOO has internal > tristate buffers) to select the register you want to read back.Got your point about the multiplexer.> What I do is to have a RAM that keeps a copy of what was written by > the CPU.I tend to avoid local copies of information since they may not mirror efficiently, leading to multiple sources of 'truth' which eventually may bite you. How do you guarantee on a cycle base that the two locations are perfectly matching? What happens if they differ? If you do not need cycle base accuracy then which location you rely upon?> Then the readback mux defaults to the output of this (simple > single-port) RAM unless the register is read-only or has some > side-effects that could change the register's value when it's not > being written by the CPU. If you have a design with a whole lot of > registers, you can really reduce the size of the readback mux.I now understand your, indeed valid, point.> > Of course you could save even more logic by not having readback for > values that only change when written by the CPU. These become > "write-only" registers, and the software guy then needs to keep his > own "shadow" copy of the values he wrote if he needs to read it back > later.see my opinion on multiple copies above. []> Someone said, "Opinions are like a**holes. Everyone has one, and > they all stink."See, we are not too far apart with our own personal opinion on 'opinions'.> In any case I see you removed your signature from the latest post. > ;-)That is done automatically by my mailer when I'm not the OP, so do not get too excited about that ;-)
Reply by ●January 21, 20142014-01-21
alb wrote:> Hi Gabor, > > On 1/19/2014 4:30 AM, Gabor wrote: > [] >>> I guess I do not completely follow you here, which mux are you >>> talking about? >>> >> In a system with a processor (external or embedded) you typically >> have some form of bus to read and write registers within the FPGA. >> Normally you need the outputs of these registers all the time, so >> you can't just implement the whole thing as RAM. > > I follow you if you talk about 'state registers', which of course are > needed to keep the current state of the logic, but there are lots of > 'configuration registers' which do not need constant access at > their values. > > A simple example would be the configuration of an UART, you do not need > to know *constantly* that you need a parity bit or two stop bits. These > type of 'memory' can go in a RAM. Would you agree? >Not at all. The UART needs to know how many stop bits and what sort of parity to use whenever it transmits data. That can be completely asynchronous to the CPU data bus. If the UART needed to get this info from RAM, it would need another address port to that RAM. That's a very inefficient use of hardware to avoid storing 2 or 3 bits in a separate register. If you meant that the UART would read the RAM and then keep a local copy, how is this different (in terms of resource usage) than just having the register implemented in flip-flops?>> Now if the CPU wants to be able to read back the values it wrote, >> you need a big readback multiplexer (unless your IGLOO has internal >> tristate buffers) to select the register you want to read back. > > Got your point about the multiplexer. > >> What I do is to have a RAM that keeps a copy of what was written by >> the CPU. > > I tend to avoid local copies of information since they may not mirror > efficiently, leading to multiple sources of 'truth' which eventually may > bite you. > How do you guarantee on a cycle base that the two locations are > perfectly matching? What happens if they differ? If you do not need > cycle base accuracy then which location you rely upon? > >> Then the readback mux defaults to the output of this (simple >> single-port) RAM unless the register is read-only or has some >> side-effects that could change the register's value when it's not >> being written by the CPU. If you have a design with a whole lot of >> registers, you can really reduce the size of the readback mux. > > I now understand your, indeed valid, point. > >> Of course you could save even more logic by not having readback for >> values that only change when written by the CPU. These become >> "write-only" registers, and the software guy then needs to keep his >> own "shadow" copy of the values he wrote if he needs to read it back >> later. > > see my opinion on multiple copies above.This is indeed an issue whenever you use this technique to save resources. I look at it as a trade-off. In the case of readback for read/write bits that only change when written by the CPU, the only time you would be out of synch is at start-up. In my case I would either make a rule that the software must write every register at least once before it could be read back, or I would program the "RAM" with the initial register values at config time. This works on Xilnx parts, where the configuration bitstream has bits for all BRAM locations. Not all FPGA's can do this, though. Anyway, I thought this thread was about saving device resources... -- Gabor
Reply by ●January 21, 20142014-01-21
GaborSzakacs <gabor@alacron.com> wrote: (snip)>>> In a system with a processor (external or embedded) you typically >>> have some form of bus to read and write registers within the FPGA. >>> Normally you need the outputs of these registers all the time, so >>> you can't just implement the whole thing as RAM.>> I follow you if you talk about 'state registers', which of course are >> needed to keep the current state of the logic, but there are lots of >> 'configuration registers' which do not need constant access at >> their values.>> A simple example would be the configuration of an UART, you do not need >> to know *constantly* that you need a parity bit or two stop bits. These >> type of 'memory' can go in a RAM. Would you agree?> Not at all. The UART needs to know how many stop bits and what sort of > parity to use whenever it transmits data. That can be completely > asynchronous to the CPU data bus. If the UART needed to get this info > from RAM, it would need another address port to that RAM. That's a very > inefficient use of hardware to avoid storing 2 or 3 bits in a separate > register. If you meant that the UART would read the RAM and then keep > a local copy, how is this different (in terms of resource usage) than > just having the register implemented in flip-flops?If you think of it that way, (and sometimes I do) then the microprocessor is the biggest waste of transistors ever invented. A huge number of transistors, now in the billions, to get data into, and out of, an arithmetic-logic-unit containing thousands of transistors. Most of the time, a large fraction of the logic isn't doing anything at all! Consider the old favorite of introductory digital logic laboratory courses, the digital clock. Almost nothing happens most of the time (ignore display multiplex for now), but once a second the display is updated. In the 1970s, you would build one out of TTL chips. Though the FF's had the ability to switch at MHz rates, here they ran at 1Hz or less. (Well, divide down from 60Hz.) Again, the transistors are being wasted, but now in the time domain instead of the spatial domain. A small MCU, with small, built-in RAM and ROM (maybe external ROM) has plenty of power to run a digital clock. Many more transistors than the TTL version, and they are used more often than the TTL version, but the economy of scale of building small MCUs more than makes up for it. As to the previous question, how to build a UART. If you look inside a terminal server (not that anyone uses them anymore) you find a microprocessor in place of 8 UARTs. A single mircoprocessor is fast enough to collect the bits from eight incoming serial ports, and drive the bits into eight outgoing ports, along with keeping up the TCP connections to the ethernet port. I am sure the people who designed and built some of the early computers would think it strange that we now have a loop waiting for the user to type on the keyboard. In the early days, single task batch processing made more efficient use of the available resources. Not so much later, multitasking allowed one to keep a single CPU busy, though with less efficient use of RAM. (Decreasing cost of RAM vs. CPU.) With an FPGA, one has the ability to keep a large number of transistors (gates) busy a large fraction of the time, if one has a problem big enough. -- glen
Reply by ●January 21, 20142014-01-21
Al, Most "automatic" conversion of logic from LUTs to RAMs involves using the R= AMs like ROMs, preloaded with constant data during configuration. Flash bas= ed FPGAs from MicroSemi do not have the ability to preload their BRAMs duri= ng "configuration." There is no "configuration" phase at/during startup dur= ing which they could automatically be preloaded.=20 Furthermore, the IGLOO/ProASIC3 series only provide synchronous BRAMs with = a clock cycle delay between address in and data out. They can be inferred f= rom RTL, so long as your RTL includes that clock cycle delay. If you have several identical slow speed interfaces (e.g. UARTs, SPI, I2C, = etc.) that could happily run with an effective clock rate of a fraction of = your system clock rate, look at C-slow optimization to reduce utilization. = There are a few coding tricks that ease translating a single-channel module= into a multi-channel, C-slowed module capable of replacing multiple copies= of the original.=20 Retiming can be combined with C-slowing (the two are very synergystic) to e= nable the original clock rate to be increased, recovering some of the origi= nal per-channel performance.=20 Repipelining can be combined with C-slowing (also synergystic) to hide orig= inal design latency, thus recovering some of the per-channel performance wi= thout increasing the system clock rate. Andy
Reply by ●January 25, 20142014-01-25
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1 Hi Andy, my apologies for such a delayed reply. On 1/22/2014 1:27 AM, jonesandy@comcast.net wrote: []> Most "automatic" conversion of logic from LUTs to RAMs involves > using the RAMs like ROMs, preloaded with constant data during > configuration. Flash based FPGAs from MicroSemi do not have the > ability to preload their BRAMs during "configuration." There is no > "configuration" phase at/during startup during which they could > automatically be preloaded.That is quite a good piece of information. So I can stop looking for any 'automatic' mode. Moreover any RAM logic I may profit of would need to be configured after power up via external commands. While this is certainly possible, it requires system modifications which are not so often welcome. []> If you have several identical slow speed interfaces (e.g. UARTs, > SPI, I2C, etc.) that could happily run with an effective clock rate > of a fraction of your system clock rate, look at C-slow > optimization to reduce utilization. There are a few coding tricks > that ease translating a single-channel module into a multi-channel, > C-slowed module capable of replacing multiple copies of the > original.Thanks for the hint. The way I understood this is inserting say 2 registers for each register, increasing latency without affecting the functionality, but allowing retiming. I haven't understood why they call it /C/-slowing and why /C/-registers...> Retiming can be combined with C-slowing (the two are very > synergystic) to enable the original clock rate to be increased, > recovering some of the original per-channel performance. > > Repipelining can be combined with C-slowing (also synergystic) to > hide original design latency, thus recovering some of the > per-channel performance without increasing the system clock rate.These techniques seem indeed attractive when it comes to speed optimization, but in my specific case I simply wanted to free some core logic usage to allow other functionality to be added to the design. Since these 'features' are given at such a later stage in the design phase, it seems very unlikely it will not require an architectural change. That's the main reason why I was looking at some 'automatic' feature to profit of onboard resources without the need to change too much RTL. I guess there's no easy fix for this. Al -----BEGIN PGP SIGNATURE----- Version: GnuPG v2.0.17 (MingW32) Comment: Using GnuPG with Thunderbird - http://www.enigmail.net/ iQEcBAEBAgAGBQJS48IPAAoJEPaNonZWXERQgW0IAORzqR8iX68j9u+QZEjZ67ID C3eHGLk4LddDtrX+Uf2TOYoxH0OV1gvMHqrPdbEg83sBrQtSK62ScnKLNpNnQL7y ViOnBxuyMn3IbJEp7L7MV31WjFUnhX+k4eUiRMAAEUnOUVIp5VFxlb1eUPqZr/XG KlLRKw1+a3X1i1UaO2SjlIx+p1/JVZ4fvDb+HWALnbdwFE2edktf/6APl0bee8gB sOWdTIS88NDscSZtjZBokFIOPDGoo95lOdx2bioR4WYeckZdMyOFjStrzKcDQaZb tjZKUisyeyOIkjlVur4Vso4XaG4oK+adJgNTq30B2beF+LJkx+cA1soFZgxn5WA= =qYUI -----END PGP SIGNATURE-----
Reply by ●January 25, 20142014-01-25
alb <alessandro.basili@cern.ch> wrote: (snip) (snip)> Thanks for the hint. The way I understood this is inserting say 2 > registers for each register, increasing latency without affecting the > functionality, but allowing retiming. I haven't understood why they > call it /C/-slowing and why /C/-registers...I only learned about C-slow a year or two ago, and wasn't sure why it was so different from the pipelining that computer designers did in the 1960's and 1970's. And yes, I don't know what the C is for. -- glen





