I'm designing another FSM and I've run into the problem I always have when trying to pipeline them for high-speed designs. I'll show a simple example. STATE2: begin if (condition) begin state <= STATE3; y <= a*b+c; end end The problem occurs if I need to add pipelining to meet speed requirements: in this case, the multiplier inputs, output, and adder output must be registered. So I have to end up doing this in the FSM: STATE2: begin if (condition) begin state <= STATE3; a_pipe <= a; b_pipe <= b; end end and outside the FSM I have to have concurrent FSMs or other logic: always@(posedge clk) begin m <= a_pipe * b_pipe; p <= m + c; y <= p; end The whole simplicity of the FSM is destroyed. Debugging and maintaining it becomes very difficult. The output y isn't available for three cycles, so I have to figure out what states the FSM might be in at that point and ensure that I don't use y before it's ready. Using any sort of FSM editor (like those bubble diagram schematic editors) is completely precluded. A similar problem crops up when 'condition' is a complex equation and has to be pipelined into the past. A common example if 'condition' has a comparison that doesn't meet timing: if (stuff && wide_counter==32'b3) ... If this doesn't meet timing, I can instead pipeline the condition outside the FSM: wide_counter_eq_3 <= wide_counter==32'b2; // will be 3 on next cycle and then inside the FSM: if (stuff && wide_counter_eq_3) ... but this can get messy and get screwed up if the counter doesn't always increment from 2 to 3. My question: what is the cleanest way to describe an FSM requiring pipelining? Is there some sort of tool that will let me make a nice bubble diagram but also indicate which operations need pipelined and will then generate the proper synthesizable HDL? -Kevin
Style for Highly-Pipelined State Machines
Started by ●April 29, 2008
Reply by ●April 29, 20082008-04-29
Kevin Neilson wrote:> My question: what is the cleanest way to describe an FSM requiring > pipelining?I don't know, but I use a step counter and a case of that step counter in a synchronous block. -- Mike Treseler
Reply by ●April 29, 20082008-04-29
"Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message news:fv7i38$69n6@cnn.xsj.xilinx.com...> > My question: what is the cleanest way to describe an FSM requiring > pipelining?As is usually the case...'it depends'. There are two basic approaches: 1. Add additional states to the FSM to match the latency of the newly pipelined operation. 2. Add a request/acknowledge handshake signal pair into and out of the FSM that controls movement to the next state. #1 is fairly straightforward, but as you're finding out gets to be a pain if you use this method exclusively instead of just for relatively simple cases that won't change. This method can also tend to create fairly cumbersome logic implementation as well (i.e. the state machine logic can get to become the critical timing path). For #2, with the request/acknowledge signals what you're doing is refactoring your logic and using signals to indicate when that process should wake up and when it has produced a result. Another way of looking at it is that you're 'subcontracting' or 'outsourcing' that hunk of logic. In the example you gave the multiply/add operation is the function that you're outsourcing. From the perspective of the state machine then it simply needs to tell this other hunk-o-logic when to start (i.e. request) and the hunk-o-logic in turn needs to tell the FSM when it has completed (i.e. acknowledge). This basic approach scales very well and is generally tolerant of further changes in the actual latency. Using your multiply/add example, you might at first decide to have one stage of latency (registering the inputs) and then later decide to make it three (register, inputs, multiply, and sum). The request/acknowledge signal *interface* between the hunk-o-logic and the FSM does not need to change, nor does the FSM itself (unlike approach #1). All that does need to change is you add a couple more flops of delay to generate the acknowledge. As you find other critical timing paths you'll still need to figure out exactly which sub-functions need to be segregated out for pipelining, but once you've done this, the approach is the same: simply create a request/acknowledge pair to control that sub-function and tie those signals into the state machine. Put the source code for the logic needed to generate the acknowledge physically right by the hunk-o-logic function itself and it makes it easy to maintain as well. If the hunk-o-logic is complex enough, you might consider making it it's own entity/architecture. If not, maybe put it in it's own separate process (something I do to kind of break up the actual source code text into manageable sized pieces to make it easy to see what that bit does). In any case, what you're basically doing is adding a bit of hierarchy to your design whether it is formal (i.e. separate entities) or less formal (separate clocked process). Once you've physically segregated the stuff, you can probably also see that it wouldn't be that hard to parameterize it so that you could have generics select whether the latency in some fashion...but that's a bit off topic. The other thing to consider is whether the latency being introduced by this outsourced logic needs to be 'compensated for' in some fashion or is it OK to simply wait for the acknowledge. In some instances, it is fine for the FSM to simply wait in a particular state until the acknowledge comes back. In others you need to be feeding new data into the hunk-o-logic on every clock cycle even though you haven't got the results from the first back. In that situation you still have the req/ack pair but now the ack is simply saying that the request has been accepted for processing, the actual results will be coming out later. Now the hunk-o-logic needs an additional output to flag when the output is actually valid. This output data valid signal would typically tend to feed into a separate FSM or some other logic (i.e. 'usually' not the first FSM). The first FSM controls feeding stuff in, the second FSM or other processing logic is in charge of taking up the outputs and doing something with it. When you ponder on this approach some more, you will come to the realization that this signalling back and forth between various sub-processes boils down to simply managing data transfer. Once that realization has settled in, it is worthwhile to study existing data transfer techniques (I'd suggest Wishbone and Avalon, I prefer Avalon), decide for yourself which signalling scheme to use and stick to it, using that specification's naming/logic conventions and go from there and never look back at all the other possible ad-hoc ways of doing it.> Is there some sort of tool that will let me make a nice bubble diagram but > also indicate which operations need pipelined and will then generate the > proper synthesizable HDL?Altera's Quartus has a state machine machine viewer that can be exported for documentation if that's what you're looking for. Kevin Jennings
Reply by ●May 2, 20082008-05-02
KJ wrote:> "Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message > news:fv7i38$69n6@cnn.xsj.xilinx.com... >> My question: what is the cleanest way to describe an FSM requiring >> pipelining?...> The other thing to consider is whether the latency being introduced by this > outsourced logic needs to be 'compensated for' in some fashion or is it OK > to simply wait for the acknowledge. In some instances, it is fine for the > FSM to simply wait in a particular state until the acknowledge comes back. > In others you need to be feeding new data into the hunk-o-logic on every > clock cycle even though you haven't got the results from the first back. In > that situation you still have the req/ack pair but now the ack is simply > saying that the request has been accepted for processing, the actual results > will be coming out later. Now the hunk-o-logic needs an additional output > to flag when the output is actually valid. This output data valid signal > would typically tend to feed into a separate FSM or some other logic (i.e. > 'usually' not the first FSM). The first FSM controls feeding stuff in, the > second FSM or other processing logic is in charge of taking up the outputs > and doing something with it. >...> > Kevin Jennings >In this case I do indeed have to continue to keep the pipe full, so inserting wait states is not an option. And the latency of the "hunk of logic", aka concurrent process, is actually significant because I have to get the result and feed it back into the FSM. This example shows why: STATE2: begin if (condition) begin state <= STATE3; y <= (a*b+c)*d; end end I have to get the result (a*b+c) and then feed it back into the FSM so I can multiply by d. Why not just let the concurrent process handle that? Because I want to limit my resource usage to a single DSP48, so I have to schedule the multiplications inside the FSM. But I'll have to check out the Wishbone thing you're talking about. -Kevin
Reply by ●May 2, 20082008-05-02
I think you should do is put your "piplied" . the number of state you need to add =3D=3D the number of cycle you need to use for (a*b+c)*d. that's you exactly add the operation inside the states. but not out side the FSM On May 2, 2:20=A0pm, Kevin Neilson <kevin_neil...@removethiscomcast.net> wrote:> KJ wrote: > > "Kevin Neilson" <kevin_neil...@removethiscomcast.net> wrote in message > >news:fv7i38$69n6@cnn.xsj.xilinx.com... > >> My question: =A0what is the cleanest way to describe an FSM requiring > >> pipelining? > ... > > The other thing to consider is whether the latency being introduced by t=his> > outsourced logic needs to be 'compensated for' in some fashion or is it =OK> > to simply wait for the acknowledge. =A0In some instances, it is fine for=the> > FSM to simply wait in a particular state until the acknowledge comes bac=k.> > In others you need to be feeding new data into the hunk-o-logic on every=> > clock cycle even though you haven't got the results from the first back.==A0In> > that situation you still have the req/ack pair but now the ack is simply=> > saying that the request has been accepted for processing, the actual res=ults> > will be coming out later. =A0Now the hunk-o-logic needs an additional ou=tput> > to flag when the output is actually valid. =A0This output data valid sig=nal> > would typically tend to feed into a separate FSM or some other logic (i.=e.> > 'usually' not the first FSM). =A0The first FSM controls feeding stuff in=, the> > second FSM or other processing logic is in charge of taking up the outpu=ts> > and doing something with it. > > ... > > > Kevin Jennings > > In this case I do indeed have to continue to keep the pipe full, so > inserting wait states is not an option. =A0And the latency of the "hunk of=> logic", aka concurrent process, is actually significant because I have > to get the result and feed it back into the FSM. =A0This example shows why=:> > STATE2: begin > =A0 =A0if (condition) > =A0 =A0 =A0begin > =A0 =A0 =A0 =A0state <=3D STATE3; > =A0 =A0 =A0 =A0y =A0 =A0 <=3D (a*b+c)*d; > =A0 =A0 =A0end > end > > I have to get the result (a*b+c) and then feed it back into the FSM so I > can multiply by d. =A0Why not just let the concurrent process handle that?=> =A0 Because I want to limit my resource usage to a single DSP48, so I have=> to schedule the multiplications inside the FSM. =A0But I'll have to check > out the Wishbone thing you're talking about. > -Kevin
Reply by ●May 2, 20082008-05-02
Kevin Neilson wrote:> I have to get the result (a*b+c) and then feed it back into the FSM so I > can multiply by d. Why not just let the concurrent process handle that? > Because I want to limit my resource usage to a single DSP48, so I have > to schedule the multiplications inside the FSM.I still like the idea of a step counter. On tick one, I do x <= a * b *c; On tick two, I do y <= x * d; and so on ... -- Mike Treseler
Reply by ●May 3, 20082008-05-03
"Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message news:fvfm29$on81@cnn.xsj.xilinx.com...> KJ wrote: >> "Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message >> news:fv7i38$69n6@cnn.xsj.xilinx.com... >>> My question: what is the cleanest way to describe an FSM requiring >>> pipelining? > ... >> The other thing to consider is whether the latency being introduced by >> this outsourced logic needs to be 'compensated for' in some fashion or is >> it OK to simply wait for the acknowledge. In some instances, it is fine >> for the FSM to simply wait in a particular state until the acknowledge >> comes back. In others you need to be feeding new data into the >> hunk-o-logic on every clock cycle even though you haven't got the results >> from the first back. In that situation you still have the req/ack pair >> but now the ack is simply saying that the request has been accepted for >> processing, the actual results will be coming out later. Now the >> hunk-o-logic needs an additional output to flag when the output is >> actually valid. This output data valid signal would typically tend to >> feed into a separate FSM or some other logic (i.e. 'usually' not the >> first FSM). The first FSM controls feeding stuff in, the second FSM or >> other processing logic is in charge of taking up the outputs and doing >> something with it. >> > ... >> >> Kevin Jennings > In this case I do indeed have to continue to keep the pipe full, so > inserting wait states is not an option. And the latency of the "hunk of > logic", aka concurrent process, is actually significant because I have to > get the result and feed it back into the FSM. This example shows why: > > STATE2: begin > if (condition) > begin > state <= STATE3; > y <= (a*b+c)*d; > end > end > > I have to get the result (a*b+c) and then feed it back into the FSM so I > can multiply by d. Why not just let the concurrent process handle that? > Because I want to limit my resource usage to a single DSP48, so I have to > schedule the multiplications inside the FSM. But I'll have to check out > the Wishbone thing you're talking about.Well, just the fact that you're time sharing the DSP48 means that you're not processing something new on every clock cycle which just screams out to me that you'd want to implement this with a request/acknowledge type of framework. Consider having a black box that has two logical interfaces called 'inp' and 'out'. The 'inp' interface will be written to by some external thingy and provide 'a', 'b', 'c' and 'd' inputs. The black box will compute "y <= (a*b+c)*d;" and make it available on the 'out' interface via a read command. Using Avalon naming nomenclature then this black box will have the following set of signals: inp_write: input inp_writedata: input vector inp_waitrequest: output out_read: input out_readdata: output vector out_waitrequest: output What this black box would do is provide the output of the calculation "y <= ..." on the signal 'out_readdata' in response to a read request on 'out_read'. If the calculation has not been completed (maybe 'a', 'b' and 'c' aren't even available yet), the output signal 'out_waitrequest' would be set. So from the perspective of someone trying to use this black box, they would simply set 'out_read' active and wait until 'out_waitrequest' is not active. On that one particular clock cycle, 'out_readdata' has the data. Now in order to perform the calculation the black box needs 'a', 'b', 'c' and 'd'. For simplicity, I'll assume that they all become available at the same time. At that time the external thingy talking to the black box will set 'inp_write' active and 'inp_writedata' to contain 'a', 'b', 'c' and 'd' (don't constrain yourself into thinking of this as being 'bytes' or 'words', etc.). If the black box in turn sets the 'inp_waitrequest' output to a 1 then it means that the external thingy needs to hold 'inp_write' and 'inp_writedata' without changing them because the black box, for whatever reason, is not quite ready (maybe because the results of the previous calculation have not yet been read as an example). The 'external thingy' that controls the black box, is possibly your state machine that is basically signalling the black box when it has new data to process. The interface between these two then consists of the above set of well defined signals. The state machine doesn't need to know or care explicitly about what the actual latency is in getting through the black box. In order for your system to operate properly, you may need to have some latency requirement, but the point here is that the controlling state machine can be oblivious to what that latency actually is. Now turn to the black box. Since you want to share use of the DSP48 so that it gets reused this means that there will be a state machine inside it in some fashion. After receiving new data on the 'inp' interface, the black box would set the inp_waitrequest active on the next clock cycle to prevent subsequent writes from occurring until the black box is ready to accept more data. Then you go through your calculation and a couple clock cycles later you've finally computed the requested output....now what? You've got the output needed for 'out_readdata' to send out with the result. Up until that point, 'out_waitrequest' would be active to indicate that the output is not available. But once it is, the black box would drop 'out_waitrequest'. If 'out_read' is active then the result has been passed along, if not then you need to hold on to the result until 'out_read' does get set. It all might sound complicated and that it will take up considerable logic resources to implement but in fact it doesn't. When all the dust has settled the logic resources used up in the part will be pretty much identical to any other scheme you can come up with (such as counters or extra states in a state machine). In exchange you'll get a much easier to understand and maintain design with less chances for having some logic hole that occurs under only very infrequent conditions which is very easy to get when you have a convoluted difficult to follow single state machine. Lastly, I've used the Avalon naming convention but Wishbone is logically identical, instead of a 'waitrequest' they have 'acknowledge' which are logically just the not() of one other. Avalon is better when it comes to dealing with read cycle latency, it defines a specific signal for this whereas Wishbone doesn't directly handle this at all but has some additional signals that can be used for any purpose, one of which can be to handle read cycle latency. Kevin Jennings
Reply by ●May 5, 20082008-05-05
KJ wrote:> "Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message > news:fvfm29$on81@cnn.xsj.xilinx.com... >> KJ wrote: >>> "Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message >>> news:fv7i38$69n6@cnn.xsj.xilinx.com... >>>> My question: what is the cleanest way to describe an FSM requiring >>>> pipelining? >> ... >>> The other thing to consider is whether the latency being introduced by >>> this outsourced logic needs to be 'compensated for' in some fashion or is >>> it OK to simply wait for the acknowledge. In some instances, it is fine >>> for the FSM to simply wait in a particular state until the acknowledge >>> comes back. In others you need to be feeding new data into the >>> hunk-o-logic on every clock cycle even though you haven't got the results >>> from the first back. In that situation you still have the req/ack pair >>> but now the ack is simply saying that the request has been accepted for >>> processing, the actual results will be coming out later. Now the >>> hunk-o-logic needs an additional output to flag when the output is >>> actually valid. This output data valid signal would typically tend to >>> feed into a separate FSM or some other logic (i.e. 'usually' not the >>> first FSM). The first FSM controls feeding stuff in, the second FSM or >>> other processing logic is in charge of taking up the outputs and doing >>> something with it. >>> >> ... >>> Kevin Jennings >> In this case I do indeed have to continue to keep the pipe full, so >> inserting wait states is not an option. And the latency of the "hunk of >> logic", aka concurrent process, is actually significant because I have to >> get the result and feed it back into the FSM. This example shows why: >> >> STATE2: begin >> if (condition) >> begin >> state <= STATE3; >> y <= (a*b+c)*d; >> end >> end >> >> I have to get the result (a*b+c) and then feed it back into the FSM so I >> can multiply by d. Why not just let the concurrent process handle that? >> Because I want to limit my resource usage to a single DSP48, so I have to >> schedule the multiplications inside the FSM. But I'll have to check out >> the Wishbone thing you're talking about. > > Well, just the fact that you're time sharing the DSP48 means that you're not > processing something new on every clock cycle which just screams out to me > that you'd want to implement this with a request/acknowledge type of > framework. ... > > Kevin Jennings > >But I *do* have to process something on every cycle. Consider that I have to process these two equations: y0 <= (a0*b0+c0)*d0; y1 <= (a1*b1+c1); Now, if you look at the structure of the DSP48, you can see that I can't even process these two sequentially. I can send off (a0*b0+c0)*d0 to the black box thingy you speak of, but this can't be processed without dead cycles: I have to get the result of (a0*b0+c0) before I multiply it with d0, and if the DSP48 is fully pipelined, that means that the multiplier is unused for three cycles. It's similar to a superscalar process with dependencies. So I have to reschedule: I put (a0*b0+c0) into the pipe, then put in (a1*b1+c1) (which has no dependency on what is in the pipe), and then when the result of (a0*b0+c0) pops out I can feed it back into the DSP48 and multiply it with d0 to get y0. In the meantime y1 pops out. Without this intermixed scheduling I end up with too many dead cycles and then I need to use too many DSP48s. Maybe what I really want is a way to take the superscalar scheduling techniques that have been used for a long time and apply them in a similar way to HDL. -Kevin
Reply by ●May 5, 20082008-05-05
Mike Treseler wrote:> Kevin Neilson wrote: > >> I have to get the result (a*b+c) and then feed it back into the FSM so I >> can multiply by d. Why not just let the concurrent process handle that? >> Because I want to limit my resource usage to a single DSP48, so I have >> to schedule the multiplications inside the FSM. > > I still like the idea of a step counter. > > On tick one, I do x <= a * b *c; > On tick two, I do y <= x * d; > and so on ... > > -- Mike TreselerThat is essentially what I'm doing; I'm just trying to find a syntactically better way to design this pipelined stuff without having a bunch of interdependent concurrent FSMs (or a single FSM and a bunch of logic outside). -Kevin
Reply by ●May 6, 20082008-05-06
"Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message news:fvo47b$on52@cnn.xsj.xilinx.com...> KJ wrote: >> "Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message >> news:fvfm29$on81@cnn.xsj.xilinx.com... >>> KJ wrote: >>>> "Kevin Neilson" <kevin_neilson@removethiscomcast.net> wrote in message >>>> news:fv7i38$69n6@cnn.xsj.xilinx.com... >>>>> My question: what is the cleanest way to describe an FSM requiring >> >> Well, just the fact that you're time sharing the DSP48 means that you're >> not processing something new on every clock cycle which just screams out >> to me that you'd want to implement this with a request/acknowledge type >> of framework. ... >> >> Kevin Jennings > But I *do* have to process something on every cycle.You're not able to process a new set of 'a', 'b', 'c' and 'd' on every clock cycle since the DSP48 is time shared (by your choice) and that was my point. Time multiplexing the DSP48 to keep *it* busy on every clock cycle is not the same thing.> Consider that I have to process these two equations: > > y0 <= (a0*b0+c0)*d0; > y1 <= (a1*b1+c1); > > Now, if you look at the structure of the DSP48, you can see that I can't > even process these two sequentially. I can send off (a0*b0+c0)*d0 to the > black box thingy you speak of, but this can't be processed without dead > cycles: I have to get the result of (a0*b0+c0) before I multiply it with > d0, and if the DSP48 is fully pipelined, that means that the multiplier is > unused for three cycles.That's only true if the addition can't be done combinatorially. If it can then the calculation of 'y0' takes two clock cycles and the DSP48 is fully utilized. The answer pops out after two clock cycles of latency, the DSP48 hums along doing something useful on every tick.> It's similar to a superscalar process with dependencies. So I have to > reschedule: I put (a0*b0+c0) into the pipe, then put in (a1*b1+c1) (which > has no dependency on what is in the pipe), and then when the result of > (a0*b0+c0) pops out I can feed it back into the DSP48 and multiply it with > d0 to get y0. In the meantime y1 pops out. Without this intermixed > scheduling I end up with too many dead cycles and then I need to use too > many DSP48s. >And depending on just what the bottlenecks in the design are, one can do all kinds of things. But no matter what, you still need to interface *to* that thing, no matter what it does and no matter how wide of an input vector it takes (i.e. a0, b0, c0, d0, a1, b1, c1...if that's what it takes). In other words, a0, b0, c0, d0, a1, b1 and c1 all need to get in somehow; y0 and y1 both need to make it out and you need to flag when they are valid and that flagging is functionally the same thing as handshaking. Kevin Jennings





