comp.arch.fpga | Zero Address CPU logic

On 8/29/2016 7:55 PM, rickman wrote:
 > On 8/29/2016 4:30 PM, Cecil Bayona wrote:
 >> Nothing Fancy, that is why in my earlier post I mentioned that I don't
 >> have a lot of experience. I been working  a 32 bit stack based CPU, but
 >> it's a work in progress, I'm still sorting it out, it taken less than
 >> 20% of the chip, but a stack CPU are rather simple compared to other
 >> CPU's, when finished it should be pretty nice, most instructions take
 >> one clock to execute, and it used packed instructions, 5 instructions to
 >> a word fetch. Originally it was on a Lattice Brevia2, I am now
 >> converting it to a Artix-7 board, but there is software involved too so
 >> it's going slow and I'm learning as I go.
 >
 > Just a comment on your stack processor.  I've done some design work with
 > stack processors and read about a lot of designs.  In my humble opinion,
 > if you have multiple cycle instructions, you are doing it wrong.  I
 > don't want to steal the thread.  If you care to discuss this we can
 > start another thread.
 >
I'm not sure why you think that it uses  multiple cock instruction, I 
mentioned that most occur in one clock, the exception is load immediate 
it takes the instruction fetch, then a second fetch for the 32 bit value 
to push on the stack, all others take one clock. Even that one can take 
place in one clock with extra hardware to fetch RAM with a buffer to 
hold two 32 bit words, an alternative is to use two clocks one to 
execute the other to fetch program instructions.

What is does have is multiple instructions on one program word, it's 
five instructions or one depending on what it is, jump, and call take 
the whole 32 bit word and is not packed, everything else is 5 
instructions to a 32 bit program word so you have fewer memory fetches.

-- 
Cecil - k5nwa

Reply by rickman ●August 30, 20162016-08-30

On 8/30/2016 12:03 AM, Cecil Bayona wrote:
> On 8/29/2016 7:55 PM, rickman wrote:
>> On 8/29/2016 4:30 PM, Cecil Bayona wrote:
>>> Nothing Fancy, that is why in my earlier post I mentioned that I don't
>>> have a lot of experience. I been working  a 32 bit stack based CPU, but
>>> it's a work in progress, I'm still sorting it out, it taken less than
>>> 20% of the chip, but a stack CPU are rather simple compared to other
>>> CPU's, when finished it should be pretty nice, most instructions take
>>> one clock to execute, and it used packed instructions, 5 instructions to
>>> a word fetch. Originally it was on a Lattice Brevia2, I am now
>>> converting it to a Artix-7 board, but there is software involved too so
>>> it's going slow and I'm learning as I go.
>>
>> Just a comment on your stack processor.  I've done some design work with
>> stack processors and read about a lot of designs.  In my humble opinion,
>> if you have multiple cycle instructions, you are doing it wrong.  I
>> don't want to steal the thread.  If you care to discuss this we can
>> start another thread.
>>
> I'm not sure why you think that it uses  multiple cock instruction, I
> mentioned that most occur in one clock, the exception is load immediate
> it takes the instruction fetch, then a second fetch for the 32 bit value
> to push on the stack, all others take one clock. Even that one can take
> place in one clock with extra hardware to fetch RAM with a buffer to
> hold two 32 bit words, an alternative is to use two clocks one to
> execute the other to fetch program instructions.
>
> What is does have is multiple instructions on one program word, it's
> five instructions or one depending on what it is, jump, and call take
> the whole 32 bit word and is not packed, everything else is 5
> instructions to a 32 bit program word so you have fewer memory fetches.

Are you using external memory?  The CPUs I've designed were 100% 
internal to the FPGA so there was no advantage to fetching multiple 
instructions.  In fact, the multiplexer needed was just more delay.

I thought you design was not one clock per instruction because of what 
you said.  Your design is using a single memory interface.  It is common 
for stack machines to separate data and instruction space.  But if you 
are working out of external memory that is not so easy.

I assume you have read the various similar work that has been done?  The 
J1 is very interesting in that it was so durn simple.  A true MISC 
design and effective.  If you meander over to comp.lang.forth there are 
folks there who have designed some MISC CPUs which have proven their 
worth.  Bernd Paysan designed the B16 which seems like an effective 
design.  I've never tried to work with it, but it sounds similar to 
yours in they way it combines multiple instructions into a single 
instruction word.

-- 

Rick C

Reply by Cecil Bayona ●August 30, 20162016-08-30

On 8/30/2016 1:11 AM, rickman wrote:
> On 8/30/2016 12:03 AM, Cecil Bayona wrote:
>> On 8/29/2016 7:55 PM, rickman wrote:
>>> On 8/29/2016 4:30 PM, Cecil Bayona wrote:
>>>> Nothing Fancy, that is why in my earlier post I mentioned that I don't
>>>> have a lot of experience. I been working  a 32 bit stack based CPU, but
>>>> it's a work in progress, I'm still sorting it out, it taken less than
>>>> 20% of the chip, but a stack CPU are rather simple compared to other
>>>> CPU's, when finished it should be pretty nice, most instructions take
>>>> one clock to execute, and it used packed instructions, 5
>>>> instructions to
>>>> a word fetch. Originally it was on a Lattice Brevia2, I am now
>>>> converting it to a Artix-7 board, but there is software involved too so
>>>> it's going slow and I'm learning as I go.
>>>
>>> Just a comment on your stack processor.  I've done some design work with
>>> stack processors and read about a lot of designs.  In my humble opinion,
>>> if you have multiple cycle instructions, you are doing it wrong.  I
>>> don't want to steal the thread.  If you care to discuss this we can
>>> start another thread.
>>>
>> I'm not sure why you think that it uses  multiple cock instruction, I
>> mentioned that most occur in one clock, the exception is load immediate
>> it takes the instruction fetch, then a second fetch for the 32 bit value
>> to push on the stack, all others take one clock. Even that one can take
>> place in one clock with extra hardware to fetch RAM with a buffer to
>> hold two 32 bit words, an alternative is to use two clocks one to
>> execute the other to fetch program instructions.
>>
>> What is does have is multiple instructions on one program word, it's
>> five instructions or one depending on what it is, jump, and call take
>> the whole 32 bit word and is not packed, everything else is 5
>> instructions to a 32 bit program word so you have fewer memory fetches.
>
> Are you using external memory?  The CPUs I've designed were 100%
> internal to the FPGA so there was no advantage to fetching multiple
> instructions.  In fact, the multiplexer needed was just more delay.
>
> I thought you design was not one clock per instruction because of what
> you said.  Your design is using a single memory interface.  It is common
> for stack machines to separate data and instruction space.  But if you
> are working out of external memory that is not so easy.
>
> I assume you have read the various similar work that has been done?  The
> J1 is very interesting in that it was so durn simple.  A true MISC
> design and effective.  If you meander over to comp.lang.forth there are
> folks there who have designed some MISC CPUs which have proven their
> worth.  Bernd Paysan designed the B16 which seems like an effective
> design.  I've never tried to work with it, but it sounds similar to
> yours in they way it combines multiple instructions into a single
> instruction word.
>
I think the multiple instructions is to save on program space, 32 bits 
to an instruction will eat up the memory space ways too quickly. It is a 
Von Newman machine with a single address space, code and data is all in 
the same space.

It uses three kinds of instruction formats, short and long, and double word.

A short instruction is 6 bits so you can cram 5 instruction to a memory 
word, so there are 64 possible instructions but the bare machine uses 27 
opcodes so there is room for additional instructions. The are things 
that require no addresses such as dup, swap, return, add, etc.

Long instructions take 30 bits, they are things like jump, call, 
including conditional versions, they include a 24 bit address as part of 
the instruction which can address 16MB of program space, the upper two 
bits are unused at present

There is a single two word instruction, which pushes the second 32 bit 
word into the stack, it itself is a 6 bit instruction and there can be 
more that one "LIT" instruction, each one has an additional word to use 
as a literal. Each 32 bit program word could have 6 LIT instructions 
with each one having an additional 32 bit word following the program 
code word so if you have six literal words packed into single 32 bit 
program word, it is followed by six 32 words containing the literal 
values, the program continues 7 words after the current 6 instruction 
word, of course one can have fewer literals.

The whole thing is like a packed instruction pipeline it minimizes the 
number of fetches but any control flow instruction cancels and dumps the 
current instruction queue and does a fetch at a different place since 
the IP has changed.

The job of packing the instructions into one word is handled by the 
Forth Compiler so the user does not see it, it just makes the program 
code smaller automatically. I did some test and 5 is too long a queue, 
when doing a 16 bit version it packs 3 instructions to a word and that 
version ends up with a shorter program area so it's more efficient, in 
part because Forth changes the Program Counter often so having a longer 
instruction queue waste instruction slots. on a 16 bit version adding 
some extra instructions would be nice so you end up with add with carry, 
subtract with borrow etc so it can handle 32 operations efficiently.

Overall its an efficient design but it could be improved as anything 
else could. One could add combined instructions where a return is 
combined with some regular instructions so it execute both in one clock.

Eventually I want to work with the J1, its a simpler 16 bit machine but 
since the instruction word contains multiple fields it can have 
instructions that do multiple operations at the same time naturally, 
with a more complex compiler packing multiple instructions it might save 
on program space. It does have very limited addressing space due to it's 
instructions limited to 16 bits.
-- 
Cecil - k5nwa

Reply by rickman ●August 30, 20162016-08-30

On 8/30/2016 4:49 AM, Cecil Bayona wrote:
>
>
> On 8/30/2016 1:11 AM, rickman wrote:
>> On 8/30/2016 12:03 AM, Cecil Bayona wrote:
>>> On 8/29/2016 7:55 PM, rickman wrote:
>>>> On 8/29/2016 4:30 PM, Cecil Bayona wrote:
>>>>> Nothing Fancy, that is why in my earlier post I mentioned that I don't
>>>>> have a lot of experience. I been working  a 32 bit stack based CPU,
>>>>> but
>>>>> it's a work in progress, I'm still sorting it out, it taken less than
>>>>> 20% of the chip, but a stack CPU are rather simple compared to other
>>>>> CPU's, when finished it should be pretty nice, most instructions take
>>>>> one clock to execute, and it used packed instructions, 5
>>>>> instructions to
>>>>> a word fetch. Originally it was on a Lattice Brevia2, I am now
>>>>> converting it to a Artix-7 board, but there is software involved
>>>>> too so
>>>>> it's going slow and I'm learning as I go.
>>>>
>>>> Just a comment on your stack processor.  I've done some design work
>>>> with
>>>> stack processors and read about a lot of designs.  In my humble
>>>> opinion,
>>>> if you have multiple cycle instructions, you are doing it wrong.  I
>>>> don't want to steal the thread.  If you care to discuss this we can
>>>> start another thread.
>>>>
>>> I'm not sure why you think that it uses  multiple cock instruction, I
>>> mentioned that most occur in one clock, the exception is load immediate
>>> it takes the instruction fetch, then a second fetch for the 32 bit value
>>> to push on the stack, all others take one clock. Even that one can take
>>> place in one clock with extra hardware to fetch RAM with a buffer to
>>> hold two 32 bit words, an alternative is to use two clocks one to
>>> execute the other to fetch program instructions.
>>>
>>> What is does have is multiple instructions on one program word, it's
>>> five instructions or one depending on what it is, jump, and call take
>>> the whole 32 bit word and is not packed, everything else is 5
>>> instructions to a 32 bit program word so you have fewer memory fetches.
>>
>> Are you using external memory?  The CPUs I've designed were 100%
>> internal to the FPGA so there was no advantage to fetching multiple
>> instructions.  In fact, the multiplexer needed was just more delay.
>>
>> I thought you design was not one clock per instruction because of what
>> you said.  Your design is using a single memory interface.  It is common
>> for stack machines to separate data and instruction space.  But if you
>> are working out of external memory that is not so easy.
>>
>> I assume you have read the various similar work that has been done?  The
>> J1 is very interesting in that it was so durn simple.  A true MISC
>> design and effective.  If you meander over to comp.lang.forth there are
>> folks there who have designed some MISC CPUs which have proven their
>> worth.  Bernd Paysan designed the B16 which seems like an effective
>> design.  I've never tried to work with it, but it sounds similar to
>> yours in they way it combines multiple instructions into a single
>> instruction word.
>>
> I think the multiple instructions is to save on program space, 32 bits
> to an instruction will eat up the memory space ways too quickly. It is a
> Von Newman machine with a single address space, code and data is all in
> the same space.

Again, that is a reflection of having a common address space for 
instructions and data.  My CPUs use 8 or 9 bits for instructions while 
being data size independent.  The instruction format does not imply any 
particular data bus size.


> It uses three kinds of instruction formats, short and long, and double
> word.
>
> A short instruction is 6 bits so you can cram 5 instruction to a memory
> word, so there are 64 possible instructions but the bare machine uses 27
> opcodes so there is room for additional instructions. The are things
> that require no addresses such as dup, swap, return, add, etc.
>
> Long instructions take 30 bits, they are things like jump, call,
> including conditional versions, they include a 24 bit address as part of
> the instruction which can address 16MB of program space, the upper two
> bits are unused at present
>
> There is a single two word instruction, which pushes the second 32 bit
> word into the stack, it itself is a 6 bit instruction and there can be
> more that one "LIT" instruction, each one has an additional word to use
> as a literal. Each 32 bit program word could have 6 LIT instructions
> with each one having an additional 32 bit word following the program
> code word so if you have six literal words packed into single 32 bit
> program word, it is followed by six 32 words containing the literal
> values, the program continues 7 words after the current 6 instruction
> word, of course one can have fewer literals.

I've shied away from multicycle instructions because it means more bits 
(1 bit in this case) to indicate the cycle count which is more input(s) 
to the decoder.  I wanted to try to keep the decoder as simple as possible.


> The whole thing is like a packed instruction pipeline it minimizes the
> number of fetches but any control flow instruction cancels and dumps the
> current instruction queue and does a fetch at a different place since
> the IP has changed.

The F18A does that.  Learning how to pack instructions into the word is 
a bit tricky.  Even harder is learning how to time execution, but that's 
because it is async and does not use a fixed frequency clock.


> The job of packing the instructions into one word is handled by the
> Forth Compiler so the user does not see it, it just makes the program
> code smaller automatically. I did some test and 5 is too long a queue,
> when doing a 16 bit version it packs 3 instructions to a word and that
> version ends up with a shorter program area so it's more efficient, in
> part because Forth changes the Program Counter often so having a longer
> instruction queue waste instruction slots. on a 16 bit version adding
> some extra instructions would be nice so you end up with add with carry,
> subtract with borrow etc so it can handle 32 operations efficiently.
>
> Overall its an efficient design but it could be improved as anything
> else could. One could add combined instructions where a return is
> combined with some regular instructions so it execute both in one clock.

One of the things I have looked at briefly is breaking out the three 
"engines" so each one is separately controlled by fields in the 
instruction.  This will require a larger instruction, but 16 bits should 
be enough.  Then many types of instructions could be combined.  I didn't 
pursue it because I didn't want to work on the assembler that would 
handle it.


> Eventually I want to work with the J1, its a simpler 16 bit machine but
> since the instruction word contains multiple fields it can have
> instructions that do multiple operations at the same time naturally,
> with a more complex compiler packing multiple instructions it might save
> on program space. It does have very limited addressing space due to it's
> instructions limited to 16 bits.

I seem to recall discussing this with someone else not too long ago.  I 
don't think there actually is much parallelism possible with the J1 
design.  You can combine a return instruction with arithmetic 
instructions, and there are fields to adjust the stack independently. 
So you might be able to combine say, 2DUP + in one instruction by using 
+ with a DSTACK +1 instead of a -1.  But the useful combos will be 
limited.  The utility is also limited by the instruction frequency. 
Only 35% of the instructions in the app the J1 was designed for are ALU 
instructions which are the only ones that can be paralleled.

In my CPU design I had separate "engines" for the data stack (ALU), the 
return stack (destination for literals, addresses for all operations and 
looping control) and the instruction fetch.  If each of these had 
separate fields in the instruction word there might be more opportunity 
for parallelism.  But as I said, I didn't pursue this because the 
software support would be messy.  I'd like to get back to that, but it 
won't happen any time soon.

-- 

Rick C

Zero Address CPU logic

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group