FPGARelated.com
Forums

6502 FPGA core

Started by Frank Buss May 26, 2007
I've implemented a first version of a 6502 core. It has a very simple
architecture: First the command is read and then for every command a list
of microcodes are executed, controlled by a state machine. To avoid the
redundant VHDL typing, the VHDL code is generated with a Lisp program:

http://www.frank-buss.de/vhdl/cpu.lisp

This is the output:

http://www.frank-buss.de/vhdl/t_rex_test.vhdl

I've tested some instructions, like LDA, and looks like it works, but I'm
sure there are many bugs and not all features are implemented (e.g. BCD
mode or interrupt handling). It uses 2,960 LEs with Quartus 7.1, which is
too much compared to the 797 LEs of the T65 project. Any ideas how to
improve it? My idea was, that the synthesizer would be able to merge the
addressing mode implementations for the commands, but maybe this has to be
refactored by hand.

My goal is to beat the T65 project in LE usage. Speed and 100%
compatibility with the original 6502 (e.g. the strange S0 and V-flag
feature or the original hardware reset vectors) is not important for me,
but code compiled with http://www.cc65.org/ must work.

Most FPGAs have some kbyte memory (>5 kByte, even for inexpensive FPGAs,
freely configurable as ROM and RAM), so maybe a good idea would be to store
some microcode in memory? What instruction set is useful to implement the
6502 instruction set? Maybe a Forth-like microcode? 

Any ideas how to improve the Lisp code? I like my idea of using a lambda
function in addressing-commands, because this looks more clean than a
macro, which I've tried first, but I don't like the explicit call of
emit-lines. How can I refactor it to a more DSL like approach?

-- 
Frank Buss, fb@frank-buss.de
http://www.frank-buss.de, http://www.it4-systems.de
Frank Buss wrote:

> I've implemented a first version of a 6502 core. It has a very simple > architecture: First the command is read and then for every command a list > of microcodes are executed, controlled by a state machine. To avoid the > redundant VHDL typing, the VHDL code is generated with a Lisp program: > > http://www.frank-buss.de/vhdl/cpu.lisp > > This is the output: > > http://www.frank-buss.de/vhdl/t_rex_test.vhdl > > I've tested some instructions, like LDA, and looks like it works, but I'm > sure there are many bugs and not all features are implemented (e.g. BCD > mode or interrupt handling). It uses 2,960 LEs with Quartus 7.1, which is > too much compared to the 797 LEs of the T65 project. Any ideas how to > improve it? My idea was, that the synthesizer would be able to merge the > addressing mode implementations for the commands, but maybe this has to be > refactored by hand.
That's a lot of ground to make up. Is the 'fat' in any one area ? Does the 797LE version have BCD and Interrupts ?
> > My goal is to beat the T65 project in LE usage. Speed and 100% > compatibility with the original 6502 (e.g. the strange S0 and V-flag > feature or the original hardware reset vectors) is not important for me, > but code compiled with http://www.cc65.org/ must work.
Err, why not use/improve the T65 work ? -jg
Jim Granville wrote:

> That's a lot of ground to make up. Is the 'fat' in any one area ? > Does the 797LE version have BCD and Interrupts ?
Looks like it has interrupts, but no BCD.
> Err, why not use/improve the T65 work ?
It's more fun to implement it myself :-) I've started a new version, see below. Now it is more clean VHDL code and it should need very few LEs, but a ROM of maybe 1 kbyte. Every microcode is executed in one clock cycle. I plan to implement a call/return microcode, with a callstack size of 1 address, too, for helping to reduce the ROM size (e.g. most addressing modes can be implemented in subroutines). For writing the microcode program and creating the MIF file, I'll write a Lisp program again. library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; use work.ALL; entity t_rex_test is port( clock_50mhz: in std_logic; led: out unsigned(7 downto 0); button: in unsigned(3 downto 0); dip_switch: in unsigned(3 downto 0); neg_reset: in std_logic); end entity t_rex_test; architecture rtl of t_rex_test is -- bit position in microcode for indicating the last -- microcode command in a program constant mcode_stop_bit : integer:= 7; -- type for CPU addresses subtype address_type is std_logic_vector(15 downto 0); -- type for CPU data words subtype data_type is std_logic_vector(7 downto 0); -- microcode commands constant mcode_load_pc : data_type := x"01"; constant mcode_store_address : data_type := x"02"; -- CPU RAM signals signal address : address_type; signal data : data_type; signal q : data_type; signal wren : std_logic := '0'; -- microcode ROM signals signal mcode_address : std_logic_vector(8 downto 0); signal mcode_q : data_type; signal mcode_code : data_type; signal mcode_stop : boolean; -- scratch register signal working : address_type := x"0200"; -- current command signal command : data_type; -- CPU registers signal pc : address_type := x"0200"; signal sp : address_type := x"01ff"; signal accu : data_type; signal x : data_type; signal y : data_type; signal z_flag : std_logic; signal n_flag : std_logic; signal c_flag : std_logic; signal v_flag : std_logic; signal i_flag : std_logic; signal d_flag : std_logic; -- CPU statemachine type cpu_state_type is ( read_command_state, wait_for_read_state, read_memory_state, wait_for_mcode_index, read_mcode_index, execute_mcode, read_mcode ); signal cpu_state : cpu_state_type := read_command_state; begin -- CPU RAM instance_ram: entity ram port map ( address => address(11 downto 0), clock => clock_50mhz, data => data, wren => wren, q => q ); -- microcode ROM instance_microcode: entity microcode port map ( address => mcode_address, clock => clock_50mhz, data => x"00", wren => '0', q => mcode_q ); -- read command and execute microcode process(clock_50mhz, neg_reset) begin if neg_reset = '1' then pc <= x"0200"; sp <= x"01ff"; accu <= x"00"; x <= x"00"; y <= x"00"; z_flag <= '0'; n_flag <= '0'; elsif rising_edge(clock_50mhz) then case cpu_state is -- read next command when read_command_state => address <= pc; cpu_state <= wait_for_read_state; when wait_for_read_state => cpu_state <= read_memory_state; -- use command as index in first 256 ROM bytes when read_memory_state => pc <= pc + 1; mcode_address <= '0' & q; cpu_state <= wait_for_mcode_index; when wait_for_mcode_index => cpu_state <= read_mcode_index; -- microcode program starts at index + 256 when read_mcode_index => mcode_address <= '1' & mcode_q; cpu_state <= execute_mcode; when execute_mcode => mcode_address <= mcode_address + 1; cpu_state <= read_mcode; -- execute microcode program when read_mcode => mcode_address <= mcode_address + 1; case mcode_code is -- copy pc to working register when mcode_load_pc => working <= pc; -- store working register to RAM address register when mcode_store_address => address <= working; -- unknown microcommand, read next command when others => cpu_state <= read_command_state; end case; -- if mcode_stop_bit was set, read next command if mcode_stop then cpu_state <= read_command_state; end if; end case; end if; end process; mcode_code <= mcode_q and "01111111"; mcode_stop <= true when mcode_q(mcode_stop_bit) = '1' else false; end architecture rtl; -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Frank Buss wrote:

> Jim Granville wrote: > > >>That's a lot of ground to make up. Is the 'fat' in any one area ? >>Does the 797LE version have BCD and Interrupts ? > > > Looks like it has interrupts, but no BCD. > > >>Err, why not use/improve the T65 work ? > > > It's more fun to implement it myself :-)
OK.
> I've started a new version, see below. Now it is more clean VHDL code and > it should need very few LEs, but a ROM of maybe 1 kbyte. Every microcode is > executed in one clock cycle.
ROM makes sense, pretty much every FPGA these days have these for free, and they should be use more in Soft CPU designs.
> I plan to implement a call/return microcode, > with a callstack size of 1 address, too, for helping to reduce the ROM size > (e.g. most addressing modes can be implemented in subroutines). For writing > the microcode program and creating the MIF file, I'll write a Lisp program > again.
Let us know how the different approach impacts LE count. -jg
Nice work Frank!

I haven't looked in detail at your work, but the general idea of doing
nostalgic implementations, doing it for fun and doing it in a minimum
resource-fashion is my cup of tea. Just one suggestion that is
something that I'm on right now (or... one of the things I'm on right
now). If you are willing to sacrifice ALOT FMAX to save FPGA-resources
maybe an inner CPU with a very simple instructions-set could do? By
doing this and building the instruction-set in reusuable pieces I
think there are potential for resource-gains to earn. But if you
remember the speed these hogs were doing in the wild days (1,2,4,8
Mhz) maybe similar preformance is still OK.

OK, you won't win prices in minimum-power-usage, in readability or
probably in anything but I have a hunch this is the way to achieve
MAXIMUM usage of resources. Me, myself have been working on
implementing a minimum 68K-core this way. Still alot of work left
todo, but the current reading of 8% of a Spartan3-200K is quite nice..
My goal is to, at least, get something working in about 20% of a
Spartan3-200K.

/Magnus

On May 27, 7:20 am, Jim Granville <no.s...@designtools.maps.co.nz>
wrote:
> Frank Buss wrote: > > Jim Granville wrote: > > >>That's a lot of ground to make up. Is the 'fat' in any one area ? > >>Does the 797LE version have BCD and Interrupts ? > > > Looks like it has interrupts, but no BCD. > > >>Err, why not use/improve the T65 work ? > > > It's more fun to implement it myself :-) > > OK. > > > I've started a new version, see below. Now it is more clean VHDL code and > > it should need very few LEs, but a ROM of maybe 1 kbyte. Every microcode is > > executed in one clock cycle. > > ROM makes sense, pretty much every FPGA these days have these for free, > and they should be use more in Soft CPU designs. > > > I plan to implement a call/return microcode, > > with a callstack size of 1 address, too, for helping to reduce the ROM size > > (e.g. most addressing modes can be implemented in subroutines). For writing > > the microcode program and creating the MIF file, I'll write a Lisp program > > again. > > Let us know how the different approach impacts LE count. > > -jg
spartan3wiz wrote:

> If you are willing to sacrifice ALOT FMAX to save FPGA-resources > maybe an inner CPU with a very simple instructions-set could do?
Yes, this was my idea. I have enhanced my FPGA implementation to a Forth-like CPU, this is the current version: http://www.frank-buss.de/vhdl/t_rex_test2.vhdl It has the following microcodes: call load-pc load-address load-q load-accu load-x load-y store-pc store-address store-data store-accu store-x store-y inc dec add lshift-8 or nop The load and store commands loads and stores from the specified register to an internal stack (stack size is configurable). "call" executes a program at the location specified in the next byte. A return is implemented, if bit 7 is set in the microcode. The rest are instructions needed to make it simpler to implement the 6502 instruction set, e.g. "or" pops the first two values from stack, does a binary OR and pushs the result back to stack. Testing the microcodes with a simulator is too time consuming, so I've implemented an emulator in Lisp, which creates the opcode ROM, too and the constant list for the microcodes for pasting into the VHDL code: http://www.frank-buss.de/vhdl/cpu2.lisp Playing with it is really nice, e.g. this is the output of an interactive session: CL-USER > (dump #x1f00) 1F00: 00 00 00 00 00 00 00 00 NIL CL-USER > (execute-command) current registers: a: 00, x: 00, y: 00 pc: 0200, mcode-address: 0000, executing microcode: CALL a: 00, x: 00, y: 00 pc: 0201, mcode-address: 0119, executing microcode: LOAD-PC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011A, executing microcode: STORE-ADDRESS a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011B, executing microcode: LOAD-PC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011C, executing microcode: INC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011D, executing microcode: STORE-PC a: 00, x: 00, y: 00 pc: 0202, mcode-address: 011E, executing microcode: LOAD-Q a: 00, x: 00, y: 00 pc: 0202, mcode-address: 012B, executing microcode: STORE-ACCU a: 2A, x: 00, y: 00 pc: 0202, mcode-address: 012C, NIL CL-USER > (execute-command) current registers: a: 2A, x: 00, y: 00 pc: 0202, mcode-address: 012C, executing microcode: CALL a: 2A, x: 00, y: 00 pc: 0203, mcode-address: 010C, executing microcode: CALL a: 2A, x: 00, y: 00 pc: 0203, mcode-address: 0106, ... NIL CL-USER > (dump #x1f00) 1F00: 2A 00 00 00 00 00 00 00 NIL This was the executing of the following small program, compiled with cc65: .org $200 lda #42 sta $1f00 Now I can implement and test the rest very fast, because I can add debugging output very easily, implement all addressing modes of one instruction with a higher-level instruction to avoid (Lisp) code duplication etc. On the FPGA side I have to simulate the microcodes, only. If every microcode does what it should do, then all microcode programs should work immediatly, because they were tested with my Lisp prorgam before.
> OK, you won't win prices in minimum-power-usage, in readability or > probably in anything but I have a hunch this is the way to achieve > MAXIMUM usage of resources.
I think readability is very good (ok, maybe because I know Forth and Lisp) and power-usage should be good, too, because fewer LEs are used. My current Forth FPGA implementation needs 319 LEs (about 5% of the small Cyclone EP1C6Q240C8). But I expect 10 times slower than e.g. the T65, so the all in all cycles per power would be not so good.
> Me, myself have been working on > implementing a minimum 68K-core this way. Still alot of work left > todo, but the current reading of 8% of a Spartan3-200K is quite nice.. > My goal is to, at least, get something working in about 20% of a > Spartan3-200K.
Nice. How does your microcode looks like? Some instructions are very similar to the 6502, so maybe we can develop the perfect microcode for both :-) -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Frank Buss <fb@frank-buss.de> writes:

> I've implemented a first version of a 6502 core. It has a very simple > architecture: First the command is read and then for every command a list > of microcodes are executed, controlled by a state machine. To avoid the > redundant VHDL typing, the VHDL code is generated with a Lisp program: >
[...]
> > Any ideas how to improve the Lisp code? I like my idea of using a lambda > function in addressing-commands, because this looks more clean than a > macro, which I've tried first, but I don't like the explicit call of > emit-lines. How can I refactor it to a more DSL like approach?
(Followup was set to comp.arch.fpga, but since this is Lisp-related, I've changed followup to c.l.l.) It seems to me that a more natural way to represent this code in a Lisp program would be to use some form of syntax trees. For example, the VHDL statement if q = x"00" then z_flag <= '1'; else z_flag <= '0'; end if; could be represented with this tree: (if (= q #x00) (<= z_flag #\1) (<= z_flag #\0)) Producing VHDL code from the tree representation should be straight-forward. Peter Seibel's book has a chapter on HTML-generation which might be helpful. You might also want to look up "abstract syntax" or "syntax trees" in any compiler textbook. Sven-Olof Nystr|m
Ahh nice! I think we have connected brains! :-)

Actually my idea isn't that refined yet. I am doing the long-way-
around the problem. After seeing some of the different retro-cores out
there (6502, 8051,Z80 and 68K) that I think took just too much
resource off my poor small FPGA, I decided that it must be possible to
implement something smaller and still have full functionality. The
people who have implemented these cores straight in VHDL/Verilog are
fantastic people way over my current limit I think, so all cheers to
you!

If I/we don't need the 50MHz (or so) that can be achieved in today's
standard hobby FPGA-development-cards, maybe the FMAX CAN be, in some
way, converted into reused blocks.

I started my 68K-project by using the fantastic, already super-
optimized Picoblaze of Ken Chapman that are Xilinx-specific and SMALL:

http://en.wikipedia.org/wiki/PicoBlaze
http://www.xilinx.com/ipcenter/processor_central/picoblaze/picoblaze_user_resources.htm

By shifting over to, for example the Pacoblaze:
http://bleyer.org/pacoblaze/

I can then make it run on anything and still be quite small!

My work is only half-way finished and I'm not sure It'll fit into the
tight program-space, but by taking on hard projects at least you learn
something I think. To fit this into the program space of the Picoblaze
I need to do MAXIMUM reuse of the assembler code, thus crystallizing
out the nice reusable parts into assembler sub-routines. Maybe even
finding reuse in places where it otherwise might be missed..

By using an 8-bit CPU to emulate an 16-bit CPU I can save resources
but get a hard performance-hit. It is very time-consuming doing these
tests and there are lots of stuff left to fix, for example the memory
access problem, but I'm keep trying until... well until I don't feel
like it! :-)

Then when I'm finished (and have something working), I have several
next step possibilities of which I would like to try all.

1) Just keep the slow small core making sure it run on Picoblaze
(xilinx hardware) as well as Pacoblaze (anything..) and does its job.

2) Try removing all unused instructions from the 8-bit CPU's
instruction set, thus making it even smaller BUT destroying the
possible future upgrades and removing the compatibility of the
internal parts (this would only be pubhished as already compiled BIT-
files I think..)

3) Try adding extra instructions (from implementing the assembler sub-
routines into new instructions) by looking at profiling of solution 1)
running. By doing this we can find the perfect balance (or several
balances) between size/speed depending on the demands on the goal
circuit usage.

4) A combination of 2) and 3)

5) Maybe building something completely new out of the things learned
from all the above..

But your thoughts on doing something generic just sounds NICE!

A nice tool that kept me going this far is:
http://www.mediatronix.com/pBlazeIDE.htm

/Magnus

On 27 Maj, 19:36, Frank Buss <f...@frank-buss.de> wrote:
> spartan3wiz wrote: > > If you are willing to sacrifice ALOT FMAX to save FPGA-resources > > maybe an inner CPU with a very simple instructions-set could do? > > Yes, this was my idea. I have enhanced my FPGA implementation to a > Forth-like CPU, this is the current version: > > http://www.frank-buss.de/vhdl/t_rex_test2.vhdl > > It has the following microcodes: > > call > load-pc > load-address > load-q > load-accu > load-x > load-y > store-pc > store-address > store-data > store-accu > store-x > store-y > inc > dec > add > lshift-8 > or > nop > > The load and store commands loads and stores from the specified register to > an internal stack (stack size is configurable). "call" executes a program > at the location specified in the next byte. A return is implemented, if bit > 7 is set in the microcode. The rest are instructions needed to make it > simpler to implement the 6502 instruction set, e.g. "or" pops the first two > values from stack, does a binary OR and pushs the result back to stack. > > Testing the microcodes with a simulator is too time consuming, so I've > implemented an emulator in Lisp, which creates the opcode ROM, too and the > constant list for the microcodes for pasting into the VHDL code: > > http://www.frank-buss.de/vhdl/cpu2.lisp > > Playing with it is really nice, e.g. this is the output of an interactive > session: > > CL-USER > (dump #x1f00) > > 1F00: 00 00 00 00 00 00 00 00 > NIL > > CL-USER > (execute-command) > current registers: > a: 00, x: 00, y: 00 > pc: 0200, mcode-address: 0000, > > executing microcode: CALL > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 0119, > > executing microcode: LOAD-PC > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 011A, > > executing microcode: STORE-ADDRESS > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 011B, > > executing microcode: LOAD-PC > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 011C, > > executing microcode: INC > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 011D, > > executing microcode: STORE-PC > a: 00, x: 00, y: 00 > pc: 0202, mcode-address: 011E, > > executing microcode: LOAD-Q > a: 00, x: 00, y: 00 > pc: 0202, mcode-address: 012B, > > executing microcode: STORE-ACCU > a: 2A, x: 00, y: 00 > pc: 0202, mcode-address: 012C, > > NIL > > CL-USER > (execute-command) > current registers: > a: 2A, x: 00, y: 00 > pc: 0202, mcode-address: 012C, > > executing microcode: CALL > a: 2A, x: 00, y: 00 > pc: 0203, mcode-address: 010C, > > executing microcode: CALL > a: 2A, x: 00, y: 00 > pc: 0203, mcode-address: 0106, > > ... > > NIL > > CL-USER > (dump #x1f00) > > 1F00: 2A 00 00 00 00 00 00 00 > NIL > > This was the executing of the following small program, compiled with cc65: > > .org $200 > lda #42 > sta $1f00 > > Now I can implement and test the rest very fast, because I can add > debugging output very easily, implement all addressing modes of one > instruction with a higher-level instruction to avoid (Lisp) code > duplication etc. > > On the FPGA side I have to simulate the microcodes, only. If every > microcode does what it should do, then all microcode programs should work > immediatly, because they were tested with my Lisp prorgam before. > > > OK, you won't win prices in minimum-power-usage, in readability or > > probably in anything but I have a hunch this is the way to achieve > > MAXIMUM usage of resources. > > I think readability is very good (ok, maybe because I know Forth and Lisp) > and power-usage should be good, too, because fewer LEs are used. My current > Forth FPGA implementation needs 319 LEs (about 5% of the small Cyclone > EP1C6Q240C8). But I expect 10 times slower than e.g. the T65, so the all in > all cycles per power would be not so good. > > > Me, myself have been working on > > implementing a minimum 68K-core this way. Still alot of work left > > todo, but the current reading of 8% of a Spartan3-200K is quite nice.. > > My goal is to, at least, get something working in about 20% of a > > Spartan3-200K. > > Nice. How does your microcode looks like? Some instructions are very > similar to the 6502, so maybe we can develop the perfect microcode for both > :-) > > -- > Frank Buss, f...@frank-buss.dehttp://www.frank-buss.de,http://www.it4-systems.de
Frank Buss wrote:
<snip>
> > On the FPGA side I have to simulate the microcodes, only. If every > microcode does what it should do, then all microcode programs should work > immediatly, because they were tested with my Lisp prorgam before.
Just checking if you have seen the work of Jan Decaluwe ? http://myhdl.jandecaluwe.com/doku.php/start
> >>OK, you won't win prices in minimum-power-usage, in readability or >>probably in anything but I have a hunch this is the way to achieve >>MAXIMUM usage of resources. > > > I think readability is very good (ok, maybe because I know Forth and Lisp) > and power-usage should be good, too, because fewer LEs are used. My current > Forth FPGA implementation needs 319 LEs (about 5% of the small Cyclone > EP1C6Q240C8). But I expect 10 times slower than e.g. the T65, so the all in > all cycles per power would be not so good.
If this runs slower, one of my pet ideas for FPGA cores, is to design them to run from SerialFLASH memory. Top end ones (winbond) run at 150MBd of link speed, so can feed nearly 20MB/s of streaming code. Ideally, the core has a short-skip opcode, as the jump in such memory has a higher cost. -jg
On Mon, 28 May 2007 07:34:24 +1200, Jim Granville
<no.spam@designtools.maps.co.nz> wrote:

>Frank Buss wrote: ><snip>
> >If this runs slower, one of my pet ideas for FPGA cores, is to design >them to run from SerialFLASH memory. Top end ones (winbond) run at >150MBd of link speed, so can feed nearly 20MB/s of streaming code. >Ideally, the core has a short-skip opcode, as the jump in such memory >has a higher cost.
Or a "four address instruction" like the Pilot Ace, with SerialFlash in place of a tube full of mercury? - Brian