comp.arch.fpga | ddr with multiple users

Hi,

I have about 4 different independent things that each need to access
a ddr.

On one hand it seems I can make them all wishbone
compliant then just have a wishbone ddr interface.

Would be workable/advisable to instead just have each device
control the ddr itself, and use the ddr's own interface directly?

I'd only need one complicated mechanism to initialize the ddr
after reset, but from then on each of the user processes can just
request access to the ddr, and when granted just take over
the lines.

One concern is that ddr timing at 100 mhz is pretty tight. Having
the logic to combine 4 different sources into control signals for
the ddr might add too much overhead. Of course it can be
accomplished with a single LUT just doing a non-registered
OR, if all 4 sources know to zero out their control lines when
they're not the master...

Any tips/advice welcome.

Thanks--
Dave


-- 
David Ashley                http://www.xdr.com/dash
Embedded linux, device drivers, system architecture

Reply by Jonathan Bromley ●September 7, 20062006-09-07

On 7 Sep 2006 21:28:33 +0200, David Ashley
<dash@nowhere.net.dont.email.me> wrote:

>I have about 4 different independent things that each need to access
>a ddr.
>
>On one hand it seems I can make them all wishbone
>compliant then just have a wishbone ddr interface.
>
>Would be workable/advisable to instead just have each device
>control the ddr itself, and use the ddr's own interface directly?

Seems to me that your second idea would involve each device
having a complete DDR access controller in it.  That sounds 
like quite a bad idea to me; if you're going to make good use 
of SDRAM you need to keep track of the RAM's internal state
to some extent (which banks are active, current row address,
that sort of thing) and it would be very difficult for all four
accessors to keep that kind of internal state in step.

Of course, passing each client's requests to the RAM 
controller is sure to cost some latency if you use a common
controller.  If you have a custom controller (rather than a 
standard single-port controller on a common bus) then you
can hide most of that latency in the arbitration delay, at the
expense of some extra complexity.

It's an interesting question, though.  I need to deal with
something quite similar in the immediate future, so any
other ideas would be gratefully received!  Oh, and does
anyone have any strong opinions (positive or negative)
about any of the available open-source DDR controllers?
-- 
Jonathan Bromley, Consultant

DOULOS - Developing Design Know-how
VHDL * Verilog * SystemC * e * Perl * Tcl/Tk * Project Services

Doulos Ltd., 22 Market Place, Ringwood, BH24 1AW, UK
jonathan.bromley@MYCOMPANY.com
http://www.MYCOMPANY.com

The contents of this message may contain personal views which 
are not the views of Doulos Ltd., unless specifically stated.

Reply by David Ashley ●September 7, 20062006-09-07

Jonathan Bromley wrote:
> Seems to me that your second idea would involve each device
> having a complete DDR access controller in it.  That sounds 
> like quite a bad idea to me; if you're going to make good use 
> of SDRAM you need to keep track of the RAM's internal state
> to some extent (which banks are active, current row address,
> that sort of thing) and it would be very difficult for all four
> accessors to keep that kind of internal state in step.
> 
> Of course, passing each client's requests to the RAM 
> controller is sure to cost some latency if you use a common
> controller.  If you have a custom controller (rather than a 
> standard single-port controller on a common bus) then you
> can hide most of that latency in the arbitration delay, at the
> expense of some extra complexity.
> 
> It's an interesting question, though.  I need to deal with
> something quite similar in the immediate future, so any
> other ideas would be gratefully received!  Oh, and does
> anyone have any strong opinions (positive or negative)
> about any of the available open-source DDR controllers?

No matter what happens, 4 separate widgets need to gain
access to memory. If they have to interface to some other
controller anyway, what's the advantage? Why not make the
DDR itself the controller?

For example wishbone, the idea behind replacing DDR's own
interface with a wishbone in-between would be based on
assertions that
1) DDR is overly complex, wishbone's simpler
2) IP core reuse -- wishbone is more standard

I *need* to be able to burst large blocks of memory to the
DDR. If I use a wishbone interface, then there needs to
be some mechanism to translate a burst from one
clock domain (the wishbone's) to the DDR's clock
domain. That might involve some sort of fifo...and sounds
complicated to me.

On the other hand the DDR controller is actually not that
complicated. Using DDR itself would allow known, easy
burst accesses, and memory bandwidth can be maximized.

Regarding the DDR's internal state, I'm planning on all
widgets doing burst accesses, and each access would only
be to a single row. If each widget just precharged the row
upon exit, the overhead would be minimal, yet bandwidth
would still be good.

Finally, if a refresh cycle needs to be imposed, that can be
done with a 5th widget that just does a refresh cycle, or it
could be a function of the arbitrator itself.

-Dave

-- 
David Ashley                http://www.xdr.com/dash
Embedded linux, device drivers, system architecture

Reply by Nico Coesel ●September 7, 20062006-09-07

David Ashley <dash@nowhere.net.dont.email.me> wrote:

>Hi,
>
>I have about 4 different independent things that each need to access
>a ddr.
>
>On one hand it seems I can make them all wishbone
>compliant then just have a wishbone ddr interface.
>
>Would be workable/advisable to instead just have each device
>control the ddr itself, and use the ddr's own interface directly?

I have created something similar. I created a fifo in a block ram
which is used to source or sink data to or from the ddr memory from
and to multiple devices at different speeds. In my application I need
to write or read large bursts of data so I created a fifo which can
work only one direction at a time. I use interleaved fixed burst sizes
of 8 (16 bits per DQ line in one transaction) so the overhead is
minimal.

You might want to look into some sort of caching scheme anyway because
accessing ddr to read or write just one address is dead slow.

>I'd only need one complicated mechanism to initialize the ddr
>after reset, but from then on each of the user processes can just
>request access to the ddr, and when granted just take over
>the lines.

I created a hook into the ddr statemachine which allows me to execute
any type of 'instruction' on the ddr memory from a microcontroller.
This reduces initialisation to a software thing. Don't forget about
refresh.

>One concern is that ddr timing at 100 mhz is pretty tight. Having
>the logic to combine 4 different sources into control signals for
>the ddr might add too much overhead. Of course it can be

You can run the DDR controller at 50MHz (half the ddr memory clock)
which relaxes the timing for the address and control signals a lot
(also good for meeting EMC limits because the drive strength can be
reduced and the signals carry lower frequencies). The only line that
actually needs tight timing is CS. Fortunately, this is the least
loaded line in large multi-chip memory setups. 

The data lines are a different story. Using a clock with a fixed delay
(in my case 90 degrees is just fine) to capture the data gives more
than enough margin.

As you can read, I didn't use the MIG tool (doesn't work for a
Spartan3/200).

-- 
Reply to nico@nctdevpuntnl (punt=.)
Bedrijven en winkels vindt U op www.adresboekje.nl

Reply by John Williams ●September 7, 20062006-09-07

Hi David,

David Ashley wrote:

> I have about 4 different independent things that each need to access
> a ddr.
> 
> On one hand it seems I can make them all wishbone
> compliant then just have a wishbone ddr interface.

Some off-the-shelf options that spring to mind:

Firstly there's the opb_mch_ddr interface core that comes with Xilinx EDK - in
addition to the OPB bus interface (which you can ignore), it has 4 independent
channels that support a fairly simple simple cacheline fetch protocol (Xilinx
CacheLink).  In reality it's just their FSL port, with a specific access protocol.

It's intended for interfacing MicroBlaze CPUs to memory, but no reason you
couldn't use it for something else.  Current versions of the core are fixed
priority on the 4 ports, but I believe that round robin and other priority
schemes are in the roadmap (if you read the VHDL sources anyway).

Xilinx also has a MPMC (multiport memory controller) that was developed for the
gigabit serial reference design (GSRD), also worth a look.  I think you need to
register to download this design.

MPMC uses something called LocalLink, again just a sort of cacheline read/write
protocol, nothing too tricky I don't think, and there should be full source
examples of how to drive it in the reference design.

Ultimately you have to arbitrate somewhere, be that on a bus, in the memory
controller, or at the DDR pin/signal stage.  But as others have suggested, that
may be more trouble than its worth.

Regards,

John

Reply by Christian Kirschenlohr ●September 8, 20062006-09-08

Hi David,
i had a similar problem where 3 different clocked users,
needed to access a SDRAM Interface wich run by a clock of 166 MHz.
The main problem was that every user needed full page bursts
in/out of the RAM.
So every one of the users got its own dc-Fifo. A single
arbitration unit took care that all users got access at the
right time. As SDRAM Controller I used the OpenCore from Altera.
This Design runs very well in Cyclone and Cyclone II devices.
Hope that helps.

Regards
Christian

Reply by KJ ●September 8, 20062006-09-08

"David Ashley" <dash@nowhere.net.dont.email.me> wrote in message 
news:450072e1$3_1@x-privat.org...
> Hi,
>
> I have about 4 different independent things that each need to access
> a ddr.
Then you need a 4 port arbitrator to control access to the DDR.

> On one hand it seems I can make them all wishbone
> compliant then just have a wishbone ddr interface.
>
The arbitrator would then have 4 user ports and one DDR port, all can be 
Wishbone compliant and connect up nicely.

> Would be workable/advisable to instead just have each device
> control the ddr itself, and use the ddr's own interface directly?
>
Probably not.  Arbitration by nature needs 'global' knowledge of the scope 
of what it is arbitrating in order to be effective:  Some things needed are
- It needs to know about the 4 (or however many) users
- Preferred burst sizes for each port
- How long the other ports should wait while they're waiting for their turn 
(i.e. how important is latency).
- Arbitration scheme (round robin, etc.)

All of the above can be implemented in a single arbiter where basically all 
of the above can (and should...IMO)  be input as generic parameters and will 
be very efficient in terms of logic resources.  Without expanding the 
arbiter design beyond the design considerations that are important to your 
particular application, you can write such an arbiter and still parameterize 
it enough that it might be either directly useful the next time this 
situation comes up again or will at least provide a solid baseline for you 
to generalize it a bit more for that next application without having to 
totally re-invent the wheel.

Each device (i.e. user of the DDR) should not be concerned with this type of 
information they should just think that they have exclusive access to a 
Wishbone port.  Gumming up the user design with this info would be 
counterproductive at best.  By moving bits and pieces of the arbitration 
into each user's design code you're most likely to create something that is
- Less efficient in terms of logic resource usage than it should be (at best 
it would be no worse but I doubt it could be better)
- Created a less than useful 'device' since now it is a device that is only 
applicable if it is used in a system with three other users all sharing a 
DDR.

> I'd only need one complicated mechanism to initialize the ddr
> after reset, but from then on each of the user processes can just
> request access to the ddr, and when granted just take over
> the lines.
Again, what is needed is the arbitration logic.

>
> One concern is that ddr timing at 100 mhz is pretty tight. Having
> the logic to combine 4 different sources into control signals for
> the ddr might add too much overhead. Of course it can be
> accomplished with a single LUT just doing a non-registered
> OR, if all 4 sources know to zero out their control lines when
> they're not the master...
>
The four port inputs into the arbiter can be as fast or faster since they 
will all be inside the FPGA.  DDR may be fast but FPGA internal is faster.

KJ

Reply by David Ashley ●September 8, 20062006-09-08

KJ wrote:
>>Would be workable/advisable to instead just have each device
>>control the ddr itself, and use the ddr's own interface directly?
>>
> 
> Probably not.  Arbitration by nature needs 'global' knowledge of the scope 
> of what it is arbitrating in order to be effective:  Some things needed are
> - It needs to know about the 4 (or however many) users
> - Preferred burst sizes for each port
> - How long the other ports should wait while they're waiting for their turn 
> (i.e. how important is latency).
> - Arbitration scheme (round robin, etc.)
> KJ 

The arbitration is separate from the interface. Wishbone probably already
has an arbitor capability built in, I'd guess.

It's either
USER <-> Arbitrator <-> Wishbone <-> DDR
or
USER <-> Arbitrator <-> DDR

Actually the arbitration logic is not really *in* the chain, it just
selectively allows the USER to access the next stage on the other
side.

It's critical that bursts be handled well. DDR effectively has a minimum
burst of 2, and the 2 addresses are always at A and A+1. Probably A
is even also, but I don't remember at the moment.

Also a burst within DDR can't cross a row. Then there is arbitrary
CAS latency.

Wishbone supports bursting, but there probably aren't any restrictions.
That is, bursts probably don't need to start on an even address, and
they don't have to end before they cross some arbitrary boundary.

But I can easily make the 4 users of the DDR work within the DDR's
limitations. They can also take full advantage of the DDR's capabilities.

With the wishbone approach I get a generic piece of logic I can reuse
with other DDR's. But at what cost?

Complications:
1) To support bursting, it needs to have some sort of fifo. An easy way
would be the core stores up the whole burst, then transacts it to the
DDR when all is known. But to reduce latency, the DDR transaction
probably ought to start while the USER is still pushing data into the
wishbone interface. The whole goal is to get as close to 2 memory
accesses per clock, since that's what DDR supports.

2) The wishbone core must deal with page crossing bursts somehow.
This would mean breaking up a burst into 2 ddr bursts. Otherwise
if I impose address/burst restrictions on the wishbone core, it's
not 100% compliant, I'd expect.

3) The wishbone core must deal with the even/odd address limitations,
otherwise it's not 100% compliant, I expect.

The disadvantages of involving wishbone are
1) More complicated, more work, later time to market
2) Almost certainly will introduce latency in pipeline
3) To implement, I've got to learn wishbone AND ddr,
as opposed to just ddr now and perhaps wishbone at
a later date.

The advantages are
1) Single logic driving DDR pins, so supposedly clock timing can
be met easier.
2) More general for code reuse, since lots of things already
support wishbone.

Also of note that the end result of all this is a system meant
to be released as open source. That's why if I were going to
use wishbone I'd feel compelled to do it right.

Anyway it still isn't clear to me the wishbone approach is
automatically right for this particular application.

Thanks for everyone's input on this so far.

-Dave

-- 
David Ashley                http://www.xdr.com/dash
Embedded linux, device drivers, system architecture

Reply by KJ ●September 8, 20062006-09-08

David Ashley wrote:
> KJ wrote:
> >>Would be workable/advisable to instead just have each device
> >>control the ddr itself, and use the ddr's own interface directly?
> >>
> >
> > Probably not.  Arbitration by nature needs 'global' knowledge of the scope
> > of what it is arbitrating in order to be effective:  Some things needed are
> > - It needs to know about the 4 (or however many) users
> > - Preferred burst sizes for each port
> > - How long the other ports should wait while they're waiting for their turn
> > (i.e. how important is latency).
> > - Arbitration scheme (round robin, etc.)
> > KJ
>
> The arbitration is separate from the interface.
We may not be meaning the same thing when we say 'interface'.  In your
example, you have four 'users' who need to share a common resource, the
DDR.  Maybe you're seeing this as all one 'interface' but in reality it
consists of several of what I would call 'interfaces'.  One way to
approach this problem would be to use a single DDR controller code and
arbitrate access to the input.  In that scenario you would have 11
interfaces:

#1-4 are the individual master interfaces out of each of the four
'users'.
#5-8 are the individual slave interfaces that are single, individual
targets of #1-4.
#9 is a single master interface that connects to #10.
#10 is the slave side interface of a DDR controller code.
#11 is the master side interface from your DDR controller that is
intended to hook up to the actual DDR itself.

The task then would be to...
- Implement the function Arb() performs the translation from interfaces
#5-8 to create #9.
- Connect up all of what are now point to point connections.

Now, there is some function that I'll call f() that implements whatever
is necessary to go from interface #10 to interface #11.  Presumably
this is simply the OpenCores DDR controller or some other commercial
controller.  In any case, those cores all fit the basic interface
structure that I've defined above to get from #10 to #11.  They don't
fit mapping more than one input to the DDR directly.

What I thought you were suggesting is that you take this function f()
and replicate it 4 times and then add the arbitration between the
outputs of those four f() functions before applying it to the physical
DDR and putting this code in with the four users.

You could implement it this way, but if you do I'm confident that
you'll be chewing up many more logic resources than you would if you
instead focused on creating the arbitration function Arb() which
performs the magic to connect interfaces #5-8 to interface #9.

> Wishbone probably already
> has an arbitor capability built in, I'd guess.

Guess again.  Wishbone is strictly a point to point interface.  By that
I mean that it simply defines the signals to/from the master, the
signals to/from the slave and how those signals accomplish data
transfer.  The logic for multiple masters off a slave or multiple
slaves off of a master is outside of the Wishbone specification.

What Wishbone brings to the table is a common interface.  This is handy
since by my definition of 'interface' that are 11 of them in
play...with the exception of #11, all can be Wishbone or any other
standard you want to code to.  Wishbone doesn't bring a lot of baggage
so that IF you need to have multiple masters/slaves that you don't have
cumbersome extra logic.  Altera's 'Avalon' and OpenCores 'SimpCon'
interfaces are all similar in that regard.  They are all point to point
but can be used in a multi-master/multi-target system quite easily.

My point was that, viewed in this light, the arbitration function which
connects #5-8 to #9 can be both somewhat generic (i.e. could be used to
arbitrate other devices besides DDR) and yet still be parameterized to
handle the pecularities for your particular application (i.e. bursting
whenever possible to DDR).

<snip>
> Actually the arbitration logic is not really *in* the chain, it just
> selectively allows the USER to access the next stage on the other
> side.
Again, considering what I consider to be an 'interface' puts this
directly into the chain.  If you go the route of implementing multiple
f() functions inside the four users you still end up with the same 11
interfaces but now some of them are buried inside each of the four
users so they are only going away in the sense that you're drawing the
border line around a slightly bigger area.  Now you would have four
'users' that do not have native Wishbone interfaces but instead have a
native DDR interface that then needs to be arbitrated and translated
into the same final output interface to the DDR.

>
> It's critical that bursts be handled well. DDR effectively has a minimum
> burst of 2, and the 2 addresses are always at A and A+1. Probably A
> is even also, but I don't remember at the moment.
>
> Also a burst within DDR can't cross a row. Then there is arbitrary
> CAS latency.
>
> Wishbone supports bursting, but there probably aren't any restrictions.
> That is, bursts probably don't need to start on an even address, and
> they don't have to end before they cross some arbitrary boundary.
>
Still using the approach that I suggested, all of that can be handled
with the Arb() function as well.

> But I can easily make the 4 users of the DDR work within the DDR's
> limitations.
At the expense of now making those 4 users tuned specifically to the
nuances of DDR.  If you migrate this to some other memory technology
then you have to retune each of these four for the new nuances of that
memory.

> With the wishbone approach I get a generic piece of logic I can reuse
> with other DDR's. But at what cost?
Good question.  I can't really give details, but I'll say that I've
implemented the approach that I mentioned for interfacing six masters
to DDR and the logic resources consumed were less than but roughly
comparable to that consumed by a single DDR controller.  I had all the
same issues that you're aware of regarding how you need to properly
control DDR to get good performance and all.

The Arb() function that I implemented is also paramterized so that I
could use it to interface effectively with a PCI bridge as well without
changing any code (only the parameter settings when instantiating the
entity).

>
> Complications:
> 1) To support bursting, it needs to have some sort of fifo. An easy way
> would be the core stores up the whole burst, then transacts it to the
> DDR when all is known.
I'd suggest keeping along that train of thought as you go forward but
keep refining it.

> But to reduce latency, the DDR transaction
> probably ought to start while the USER is still pushing data into the
> wishbone interface. The whole goal is to get as close to 2 memory
> accesses per clock, since that's what DDR supports.
Here is where Wishbone lets you down a bit.  There was a discussion on
this newsgroup called 'JOP on Avalon' or something like that.  It was
primarily between myself and two others where we discussed the relative
merits of Wishbone, Avalon and SimpCon.  You might want to peruse that
a bit since with Wishbone you have to go a bit outside of the normal
spec by using what Wishbone calls 'tags' to get the full performance on
the DDR.  It's not really violating Wishbone, it's just not built into
it as cleanly as it is with Avalon or (from my limited knowledge of)
SimpCon.  The issue is that 'tags' are not required to be implemented
in any specific way but Wishbone has sort of set aside a particular way
of tagging that will help you get the full performance.

The key in any of this though is the realization that the address phase
and the data phase of any transaction are independent.  A master device
can initiate a second command on the address bus even before the first
has completed.  Even if you're not considering Altera's Avalon as a bus
for your design, their documentation of that bus and how those two
phases of a bus cycle are treated is very good and worth the read.  Pay
attention to the section regarding 'slave side read latency' and then
compare that to Wishbone.  It's good reading and may give you a
somewhat different perspective and can certainly help with this
arbitration function even if you don't implement using Avalon.

>
> 2) The wishbone core must deal with page crossing bursts somehow.
Nope, that's up to the arbitrator...if it has been given the knowledge
of the concept of 'bursts' and further parameterized by 'burst sizes'
and 'address boundaries'.

> This would mean breaking up a burst into 2 ddr bursts. Otherwise
> if I impose address/burst restrictions on the wishbone core, it's
> not 100% compliant, I'd expect.
Shouldn't need to go that way though....crossing a page boundary should
at worst cause wait states on the user's master side if it is hammering
memory.  If the user is lightly touching it then it shouldn't even
cause that.

>
> The disadvantages of involving wishbone are
> 1) More complicated, more work, later time to market
> 2) Almost certainly will introduce latency in pipeline
> 3) To implement, I've got to learn wishbone AND ddr,
> as opposed to just ddr now and perhaps wishbone at
> a later date.
>
> The advantages are
> 1) Single logic driving DDR pins, so supposedly clock timing can
> be met easier.
> 2) More general for code reuse, since lots of things already
> support wishbone.
>
It will probably consume more logic resources your way which could
impact price.

>
> Also of note that the end result of all this is a system meant
> to be released as open source. That's why if I were going to
> use wishbone I'd feel compelled to do it right.
Guess I'm confused a bit.  If it's needs to be 'open source' than it
would seem that standarding on Wishbone would be a good thing and
having tuned the 'users' to a DDR interface would be less flexible.
This might be just a case of where you draw the boundary around the
'box'.  Maybe from the perspective of someone using your widget they
don't care directly about 'user1'...'user4' just that there are 4 of
them and they can all talk to DDR and whether or not you use a standard
interface to implement them is about as relevant as whether you code
your state machines in the 'two process' template or the 'one process'
template.

What I've found though is that using a logically complete handshaking
protocol does not impose much if any extraneous logic resource usage,
and using that protocol even on internal interfaces that nobody really
cares about is actually an aid in getting everything debugged and
workng properly with the added benefit being that other people can more
readily understand those internal interfaces (if needed) since it
conforms to an established protocol.

>
> Anyway it still isn't clear to me the wishbone approach is
> automatically right for this particular application.
>
It may not be given your particular constraints.

> Thanks for everyone's input on this so far.
> 
You're welcome.  Good luck on your design

KJ

Reply by jacko ●September 8, 20062006-09-08

hi

in/out busses: low-master, high-master, bottleneck-chain

and you would need three instances.

cheers

Previous12 3 Next

ddr with multiple users

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group