FPGARelated.com
Forums

tricks to make large PLAs fast?

Started by Unknown February 27, 2006
I have a design with a large PLA, and I'm trying to make it run fast in
a Spartan-3.  It's too big for block RAM, since it has 25 inputs,
slightly fewer than 512 product terms, and 32 outputs.

My first attempt was to just translate the PLA equations to VHDL and
synthesize it with the default settings.  This uses 34% of a 3S500E, and
has a minimum cycle time of slightly under 12 ns with over 20 levels of
logic.  If I turned on timing-directed mapping, or increased the effort
levels, I suppose it might get slightly better.  But my own analysis
suggests that it should be possible to implement the PLA with no more
than 11 levels of logic worst case, seven levels for the product terms,
and 4 levels for the sums.

Anyhow, as the subject line suggests, I'm interested in any clever
tricks to get more efficient FPGA implementation of the PLA.  For
instance, would use of the carry chain speed up wide gates?  (Or do
the tools already infer that?)  Should I put some constraint on the
product terms, to keep the tools from merging the product term and
sum logic?

I'll experiment with this myself, but perhaps someone here has already
done this, in which case suggestions would be quite welcome.

Thanks,
Eric
Eric

First comment is that you have a lot of levels of logic. Might not be a 
problem if only carry logic.

Is the 12nS a result of the a build with a timing constraint set or run 
without?

Often you can assist speed by seperating logic manually into logical blocks 
and not allowing the synthesis process to share logic. Sometimes you can 
enforce a function crudly by using the negative edge to produce a sub-term 
but that does not always work well.

If you set the frequency constraint to what you want the tools will usually 
give a failure if there is no chance of making timing. The thing then is 
examine where you fail and see if you can pipeline or in some way make it 
faster. One thing to remember in SRAM based FPGAs is the the width, or 
number of inputs, to a function have a direct significance on the levels of 
logic and hence speed. So as an example if you have a state machine driving 
this logic consider using grey or binary encoding rather than the usual 
default of one-hot if it is such an input.

John Adair
Enterpoint Ltd. - Home of Hollybush1. The PC104+ FPGA Development Board.
http://www.enterpoint.co.uk

"Eric Smith" <eric@brouhaha.com> wrote in message 
news:qh4q2lcjgi.fsf@ruckus.brouhaha.com...
>I have a design with a large PLA, and I'm trying to make it run fast in > a Spartan-3. It's too big for block RAM, since it has 25 inputs, > slightly fewer than 512 product terms, and 32 outputs. > > My first attempt was to just translate the PLA equations to VHDL and > synthesize it with the default settings. This uses 34% of a 3S500E, and > has a minimum cycle time of slightly under 12 ns with over 20 levels of > logic. If I turned on timing-directed mapping, or increased the effort > levels, I suppose it might get slightly better. But my own analysis > suggests that it should be possible to implement the PLA with no more > than 11 levels of logic worst case, seven levels for the product terms, > and 4 levels for the sums. > > Anyhow, as the subject line suggests, I'm interested in any clever > tricks to get more efficient FPGA implementation of the PLA. For > instance, would use of the carry chain speed up wide gates? (Or do > the tools already infer that?) Should I put some constraint on the > product terms, to keep the tools from merging the product term and > sum logic? > > I'll experiment with this myself, but perhaps someone here has already > done this, in which case suggestions would be quite welcome. > > Thanks, > Eric
On 27 Feb 2006 00:08:29 -0800, Eric Smith <eric@brouhaha.com> wrote:

>I have a design with a large PLA, and I'm trying to make it run fast in >a Spartan-3. It's too big for block RAM, since it has 25 inputs, >slightly fewer than 512 product terms, and 32 outputs. > >My first attempt was to just translate the PLA equations to VHDL and >synthesize it with the default settings. This uses 34% of a 3S500E, and >has a minimum cycle time of slightly under 12 ns with over 20 levels of >logic. If I turned on timing-directed mapping, or increased the effort >levels, I suppose it might get slightly better. But my own analysis >suggests that it should be possible to implement the PLA with no more >than 11 levels of logic worst case, seven levels for the product terms, >and 4 levels for the sums.
Have you tried simply comparing "optimize for area" versus "optimize for delay" settings in the synthesis tool? That can make a big difference, at least with Leonardo Spectrum; you don't say which synth tool you are using. - Brian
"Eric Smith" <eric@brouhaha.com> wrote in message 
news:qh4q2lcjgi.fsf@ruckus.brouhaha.com...
>I have a design with a large PLA, and I'm trying to make it run fast in > a Spartan-3. It's too big for block RAM, since it has 25 inputs, > slightly fewer than 512 product terms, and 32 outputs. > > My first attempt was to just translate the PLA equations to VHDL and > synthesize it with the default settings. This uses 34% of a 3S500E, and > has a minimum cycle time of slightly under 12 ns with over 20 levels of > logic. If I turned on timing-directed mapping, or increased the effort > levels, I suppose it might get slightly better. But my own analysis > suggests that it should be possible to implement the PLA with no more > than 11 levels of logic worst case, seven levels for the product terms, > and 4 levels for the sums. > > Anyhow, as the subject line suggests, I'm interested in any clever > tricks to get more efficient FPGA implementation of the PLA. For > instance, would use of the carry chain speed up wide gates? (Or do > the tools already infer that?) Should I put some constraint on the > product terms, to keep the tools from merging the product term and > sum logic? > > I'll experiment with this myself, but perhaps someone here has already > done this, in which case suggestions would be quite welcome. > > Thanks, > Eric
If your terms are wide, the carry chains are often a more compact way of doing product terms *and* sums. While short carry chains often tend to take as long as about 3 logic levels, you should be able to perform the entire 25 input, 512 product term, 32 output in 512+32 carry chains or fewer (think of the nand/nand array for implementing sum of products). The number of inputs for each product term would determine the number of elements in the carry chain. I don't believe many tools do a good job of pushing logic into carry chains. If you chose to parameterize a module to implement the product terms, you could have a threshold of 7 product terms to implement as combinatorial logic versus a manual carry chain implementation to keep a balance of delays and resources. The drawback to this technique is that many common product term "chunks" which could be shared between different terms would be in separate carry chains.
John Adair wrote:
> Is the 12nS a result of the a build with a timing constraint set or run > without?
I put registers on the inputs and outputs, and set a timing constraint for the cycle time at 8 ns. Last night I tried synthesizing the same source for a Virtex 4 LX in the -12 speed grade, with timing-based mapping turned on and the map and par efforts set to high, and a 5 ns constraint. It got it down to 10 levels of logic, but still took a hair under 10 ns. I'll try this again on the Spartan 3E. It takes a very long time to run that, so most of the time during development I'll use the default settings and live with a lower clock speed. Eric
Brian Drummond wrote:
> Have you tried simply comparing "optimize for area" versus "optimize for > delay" settings in the synthesis tool?
I think I was using optimize for delay, but I didn't save the report files so I'll have to run it again to be sure.
> That can make a big difference, > at least with Leonardo Spectrum; you don't say which synth tool you are > using.
I'm just using the ISE synthesis. Thanks, Eric
Eric Smith wrote:
> I have a design with a large PLA, and I'm trying to make it run fast in > a Spartan-3. It's too big for block RAM, since it has 25 inputs, > slightly fewer than 512 product terms, and 32 outputs.
That's a large array - does it really cover 2^25 combinations, or can you compress the inputs, so that the remainder can fit into Block Ram(s) ? -jg
Jim Granville <no.spam@designtools.co.nz> writes:
> That's a large array - does it really cover 2^25 combinations, > or can you compress the inputs, so that the remainder can fit into > Block Ram(s) ?
Not really. It was a design originally implemented in custom CMOS in the early 1980s, and I don't really want to redesign it any more than necessary. There are lots of don't cares scattered throughout the AND matrix of the PLA, so it won't fit in any reasonable-sized ROM or RAM. Also, the 25-bit input words don't uniquely map to outputs; a given input word may (and often does) match multiple product terms. I've now tried putting a "keep" attribute on the product terms, and that made the timing worse. I thought it would result in better (separate) optimization of the product terms and OR terms, rather than mashing them together and trying to optimize the result. By default, ISE *is* using the carry chain, and in fact it seems to be using it for some 95-input gates, which end up being much slower than an equivalent tree would be. I'm doing a run now with the "USE_CARRY_CHAIN" attribute set to "no" for all the sum terms. I might try having hacking my Python tool that translates the PLA so that it directly instantiates trees of four-input gates for all of the product terms and sum terms, with a "keep" attribute on each gate output, and see what kind of timing that gets me. I think that should result in the fewest possible levels of logic, which would be around eight. For the first attempt, I'll just brute-force it, so that none of the terms have any shared sub-terms. If the results look reasonable, I'll try to optimize it for use of common sub-terms to reduce the total number of LUTs, while keeping the levels constant. Eric
Eric Smith wrote:
> Jim Granville <no.spam@designtools.co.nz> writes: > >>That's a large array - does it really cover 2^25 combinations, >>or can you compress the inputs, so that the remainder can fit into >>Block Ram(s) ? > > > Not really. It was a design originally implemented in custom CMOS in > the early 1980s, and I don't really want to redesign it any more than > necessary. There are lots of don't cares scattered throughout the > AND matrix of the PLA, so it won't fit in any reasonable-sized ROM > or RAM. Also, the 25-bit input words don't uniquely map to outputs; > a given input word may (and often does) match multiple product terms.
... another idea : you could target a CPLD, like XC2C512, and see what that tool flow makes of it. It may give some ideas... Their PLA has a fanin of 40, a depth of 49 per 16 MC block, so it _should_ be very efficent - but it may be too big for the tools to optimize properly, in which case some middle nodes might help ? -jg
Jim Granville <no.spam@designtools.co.nz> writes:
> ... another idea : you could target a CPLD, like XC2C512, and see what > that tool flow makes of it. It may give some ideas... > > Their PLA has a fanin of 40, a depth of 49 per 16 MC block, so it > _should_ be very efficent - but it may be too big for the tools > to optimize properly, in which case some middle nodes might help ?
Interesting idea, but the rest of the design isn't going to fit in a CPLD, and splitting it across two chips is going to be even slower. I'm just trying tricks to improve performance in the FPGA. Eric