Hi,
currently I am designing (as an amateur project) a 32bit Stack
oriented CPU with two stack-pointers (Data Stack/Return Stack) and some
additional registers, that are partly purely auxiliary, partly dedicated
for the intended purpose of the CPU as a specialized Lisp-Processor.
The control is microcoded and the greater part of the microcode is
already written and successfully tested (in simulation with Icarus).
Missing at the moment is parts of the ALU functions and the complete
interrupt/exception logic.
Nevertheless the design (done in Verilog), when synthesized, occupies
already about 1100 slices in a Spartan 3 FPGA, which I feel is a bit
heavy for what seems to me a very simple design.
Below I give the output of the Xilinx ISEWebpack synthesis tool
Logic Utilization Used Available Utilization Note(s)
Number of Slice Flip Flops 621 3,840 16%
Number of 4 input LUTs 2,561 3,840 66%
Logic Distribution
Number of occupied Slices 1,517 1,920 79%
Number of Slices containing only related logic 1,517 1,517 100%
Number of Slices containing unrelated logic 0 1,517 0%
Total Number 4 input LUTs 2,751 3,840 71%
(about 400/500 slices can be subtracted from the above figures, as they
result from accompanying structures like VGA driver and the like).
What catches my eye is, how small the utilization of Slice Flip/Flops
compared to the utilization of slices is: Can this be an expression of
the fact, that there is much combinatorial logic (adders, multiplexors)
and, relative to that, few registers/state elements? Are especially
adders, that I used quite generously to speed up the instructions, a
source of slices consumption? Or are multiplexors with many alternative
inputs more likely the culprits?
I would be very happy, if someone with more experience than me (being
just an hobbyist) could look at the Verilog source of the CPU and give
me some hints how to possibly lower the amount of resources needed by
the design.
Greetings,
J�rgen
--
J�rgen B�hm www.aviduratas.de
"At a time when so many scholars in the world are calculating, is it not
desirable that some, who can, dream ?" R. Thom
CPU design uses too many slices
Started by ●November 27, 2007
Reply by ●November 27, 20072007-11-27
J�rgen B�hm wrote:> What catches my eye is, how small the utilization of Slice Flip/Flops > compared to the utilization of slices is: Can this be an expression of > the fact, that there is much combinatorial logic (adders, multiplexors) > and, relative to that, few registers/state elements?Yes, precisely. Are especially> adders, that I used quite generously to speed up the instructions, a > source of slices consumption? Or are multiplexors with many alternative > inputs more likely the culprits? >Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A single 4-input LUT could form a single bit of a 2-input mux, wasting one input. If you need more inputs, then you have to combine several LUTs to perform one bit's worth of multiplexer. Xilinx has pretty detailed info on what the basic structure of their chips are, and you should be able to see how one would form basic logic functions out of that. It may be that Virtex would give more resources for this particular task than Spartan. Jon
Reply by ●November 27, 20072007-11-27
On Nov 27, 3:57 pm, Jon Elson <el...@wustl.edu> wrote:> J=FCrgen B=F6hm wrote: > > What catches my eye is, how small the utilization of Slice Flip/Flops=> > compared to the utilization of slices is: Can this be an expression of > > the fact, that there is much combinatorial logic (adders, multiplexors) > > and, relative to that, few registers/state elements? > > Yes, precisely. > Are especially> adders, that I used quite generously to speed up the ins=tructions, a> > source of slices consumption? Or are multiplexors with many alternative > > inputs more likely the culprits? > > Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A > single 4-input LUT could form a single bit of a 2-input mux, wasting one > input. If you need more inputs, then you have to combine several LUTs > to perform one bit's worth of multiplexer. >Another point to make is that unless you change some defaults, the mapper will not pack slices to capacity until the whole part becomes mostly full. So the number of occupied slices does not necessarily represent the most compact placement of your design. The statistics for LUTs and flip-flops are more useful for determining your actual logic usage. However given the fact that your number of slices is not a whole lot more than half the number of LUTs, I'd say that further packing of "unrelated logic" won't make your design much smaller.> Xilinx has pretty detailed info on what the basic structure of their > chips are, and you should be able to see how one would form basic logic > functions out of that. It may be that Virtex would give more resources > for this particular task than Spartan. > > JonTo benefit from changing families, you probably need to go to Virtex 5, which has 6-input LUTs. Other Virtex families look very similar to Spartan 3 from the viewpoint of the fabric.
Reply by ●November 27, 20072007-11-27
J�rgen B�hm wrote:> Hi, > > currently I am designing (as an amateur project) a 32bit Stack > oriented CPU with two stack-pointers (Data Stack/Return Stack) and some > additional registers, that are partly purely auxiliary, partly dedicated > for the intended purpose of the CPU as a specialized Lisp-Processor. > The control is microcoded and the greater part of the microcode is > already written and successfully tested (in simulation with Icarus). > Missing at the moment is parts of the ALU functions and the complete > interrupt/exception logic. > Nevertheless the design (done in Verilog), when synthesized, occupies > already about 1100 slices in a Spartan 3 FPGA, which I feel is a bit > heavy for what seems to me a very simple design. > > Below I give the output of the Xilinx ISEWebpack synthesis tool > > Logic Utilization Used Available Utilization Note(s) > Number of Slice Flip Flops 621 3,840 16% > Number of 4 input LUTs 2,561 3,840 66% > > Logic Distribution > Number of occupied Slices 1,517 1,920 79% > Number of Slices containing only related logic 1,517 1,517 100% > Number of Slices containing unrelated logic 0 1,517 0% > Total Number 4 input LUTs 2,751 3,840 71% > > (about 400/500 slices can be subtracted from the above figures, as they > result from accompanying structures like VGA driver and the like). > > What catches my eye is, how small the utilization of Slice Flip/Flops > compared to the utilization of slices is: Can this be an expression of > the fact, that there is much combinatorial logic (adders, multiplexors) > and, relative to that, few registers/state elements? Are especially > adders, that I used quite generously to speed up the instructions, a > source of slices consumption? Or are multiplexors with many alternative > inputs more likely the culprits? > > I would be very happy, if someone with more experience than me (being > just an hobbyist) could look at the Verilog source of the CPU and give > me some hints how to possibly lower the amount of resources needed by > the design.You could download the Lattice Mico32, and reality check against that, as that is open source. Most FPGAs these days have multiport RAM, so it makes sense to optimise your architecture to use that - in your case for registers, and maybe even for micocode storage. -jg
Reply by ●November 27, 20072007-11-27
Jim Granville wrote:> > You could download the Lattice Mico32, and reality check against that, > as that is open source. > Most FPGAs these days have multiport RAM, so it makes sense to optimise > your architecture to use that - in your case for registers, and maybe > even for micocode storage. >Thank your for your answer: Indeed I use RAMB16_S36 for microcode-storage, the final design will probably need four of them, as the microcode is more than 36 bit wide. The idea from the other posters to change to Virtex FPGAs is currently not an option for me, as I really want to develop for the cheaper Spartan platform, for which a lot of affordable boards are offered - if necessary I will buy a board with the next larger Spartan 3 on it. Greetings, J�rgen -- J�rgen B�hm www.aviduratas.de "At a time when so many scholars in the world are calculating, is it not desirable that some, who can, dream ?" R. Thom
Reply by ●November 28, 20072007-11-28
Jon Elson wrote:> > > Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A > single 4-input LUT could form a single bit of a 2-input mux, wasting one > input. If you need more inputs, then you have to combine several LUTs > to perform one bit's worth of multiplexer. >Currently I have predominantly three (5bit select) x (32bit data size) muxes with 16 alternatives select actually used (I overdimensioned the muxes, as I did not exactly knew before having written the microcode, how many inputs would be necessary). Are these muxes realized by cascaded LUTs, and does your above remark imply, that a 5-stages-deep chain of LUTs (1 stage for every select bit) will be used? Greetings, J�rgen -- J�rgen B�hm www.aviduratas.de "At a time when so many scholars in the world are calculating, is it not desirable that some, who can, dream ?" R. Thom
Reply by ●November 28, 20072007-11-28
J�rgen B�hm wrote:> Hi, > > currently I am designing (as an amateur project) a 32bit Stack > oriented CPU with two stack-pointers (Data Stack/Return Stack) and some > additional registers, that are partly purely auxiliary, partly dedicated > for the intended purpose of the CPU as a specialized Lisp-Processor. > The control is microcoded and the greater part of the microcode is > already written and successfully tested (in simulation with Icarus). > Missing at the moment is parts of the ALU functions and the complete > interrupt/exception logic. > Nevertheless the design (done in Verilog), when synthesized, occupies > already about 1100 slices in a Spartan 3 FPGA, which I feel is a bit > heavy for what seems to me a very simple design.[snip] The synthesized results are really the worst case scenario. Before worrying about a design, take it through mapping; that's where most of the logic optimization and signal trimming happens. We have designs that are over 100% utilized after synthesis that fit just fine after mapping. --- Joe Samson Pixel Velocity
Reply by ●November 28, 20072007-11-28
On Nov 27, 10:41 pm, J=FCrgen B=F6hm <jbo...@gmx.net> wrote:> Jim Granville wrote: > > > You could download the Lattice Mico32, and reality check against that, > > as that is open source. > > Most FPGAs these days have multiport RAM, so it makes sense to optimise > > your architecture to use that - in your case for registers, and maybe > > even for micocode storage. > > Thank your for your answer: > Indeed I use RAMB16_S36 for microcode-storage, the final design will > probably need four of them, as the microcode is more than 36 bit wide. > The idea from the other posters to change to Virtex FPGAs is currently > not an option for me, as I really want to develop for the cheaper > Spartan platform, for which a lot of affordable boards are offered - if > necessary I will buy a board with the next larger Spartan 3 on it.If you are trying to fit a given device, then you need to use the full map and place portions of the tools as well. Only then will you know for sure that your design won't fit. But what part is on your board? You are using about 75% of available resources. I can't say for sure about your design, but ALU logic can be very light if designed properly. So the rest of your design may fit easily in the part. I designed my own 16 bit CPU to have minimal size and it was about 500 LUTs, IIRC. Like you, most of the logic was from muxes, so I kept them as small as possible, even to the point of eliminating some instructions. Having an extra, unused select line makes them twice as large. BTW, any unused inputs will be optimized out by the tools. So if you don't connect the select input or data inputs, that logic will not be generated.
Reply by ●November 28, 20072007-11-28
J�rgen B�hm wrote:> Jon Elson wrote: > >> >>Yes, wide adders use a lot of LUTs. Multiplexers use up LUTs too. A >>single 4-input LUT could form a single bit of a 2-input mux, wasting one >>input. If you need more inputs, then you have to combine several LUTs >>to perform one bit's worth of multiplexer. >> > > > Currently I have predominantly three (5bit select) x (32bit data size) > muxes with 16 alternatives select actually used (I overdimensioned the > muxes, as I did not exactly knew before having written the microcode, > how many inputs would be necessary). Are these muxes realized by > cascaded LUTs, and does your above remark imply, that a 5-stages-deep > chain of LUTs (1 stage for every select bit) will be used?I think it probably does a little better than that. Really, it breaks it down into basic boolean equations, and then minimizes them. So, it may make much more efficient use than what you describe above, and it probably gets better the more inputs you have. I think three LUTs can do a 4-input MUX, you can almost do it with 2 but are one input short. If you had 5 separate select inputs (like if you were originally designing for 5 tri-state drivers on a bus) that might be less efficient than using a 3-bit binary address for the MUX. But, if a binary address is decoded somewhere in your logic to the 5 select lines, that will all fall out in the logic minimization. Jon
Reply by ●November 28, 20072007-11-28
J�rgen B�hm wrote:> Indeed I use RAMB16_S36 for microcode-storage, the final design will > probably need four of them, as the microcode is more than 36 bit wide. > The idea from the other posters to change to Virtex FPGAs is currently > not an option for me, as I really want to develop for the cheaper > Spartan platform, for which a lot of affordable boards are offered - if > necessary I will buy a board with the next larger Spartan 3 on it.Yup, the low-cost Spartan was my choice for some designs, too, as I really had no need for the special structures that the Virtex features. Jon





