I've implemented a first version of a 6502 core. It has a very simple architecture: First the command is read and then for every command a list of microcodes are executed, controlled by a state machine. To avoid the redundant VHDL typing, the VHDL code is generated with a Lisp program: http://www.frank-buss.de/vhdl/cpu.lisp This is the output: http://www.frank-buss.de/vhdl/t_rex_test.vhdl I've tested some instructions, like LDA, and looks like it works, but I'm sure there are many bugs and not all features are implemented (e.g. BCD mode or interrupt handling). It uses 2,960 LEs with Quartus 7.1, which is too much compared to the 797 LEs of the T65 project. Any ideas how to improve it? My idea was, that the synthesizer would be able to merge the addressing mode implementations for the commands, but maybe this has to be refactored by hand. My goal is to beat the T65 project in LE usage. Speed and 100% compatibility with the original 6502 (e.g. the strange S0 and V-flag feature or the original hardware reset vectors) is not important for me, but code compiled with http://www.cc65.org/ must work. Most FPGAs have some kbyte memory (>5 kByte, even for inexpensive FPGAs, freely configurable as ROM and RAM), so maybe a good idea would be to store some microcode in memory? What instruction set is useful to implement the 6502 instruction set? Maybe a Forth-like microcode? Any ideas how to improve the Lisp code? I like my idea of using a lambda function in addressing-commands, because this looks more clean than a macro, which I've tried first, but I don't like the explicit call of emit-lines. How can I refactor it to a more DSL like approach? -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
6502 FPGA core
Started by ●May 26, 2007
Reply by ●May 26, 20072007-05-26
Frank Buss wrote:> I've implemented a first version of a 6502 core. It has a very simple > architecture: First the command is read and then for every command a list > of microcodes are executed, controlled by a state machine. To avoid the > redundant VHDL typing, the VHDL code is generated with a Lisp program: > > http://www.frank-buss.de/vhdl/cpu.lisp > > This is the output: > > http://www.frank-buss.de/vhdl/t_rex_test.vhdl > > I've tested some instructions, like LDA, and looks like it works, but I'm > sure there are many bugs and not all features are implemented (e.g. BCD > mode or interrupt handling). It uses 2,960 LEs with Quartus 7.1, which is > too much compared to the 797 LEs of the T65 project. Any ideas how to > improve it? My idea was, that the synthesizer would be able to merge the > addressing mode implementations for the commands, but maybe this has to be > refactored by hand.That's a lot of ground to make up. Is the 'fat' in any one area ? Does the 797LE version have BCD and Interrupts ?> > My goal is to beat the T65 project in LE usage. Speed and 100% > compatibility with the original 6502 (e.g. the strange S0 and V-flag > feature or the original hardware reset vectors) is not important for me, > but code compiled with http://www.cc65.org/ must work.Err, why not use/improve the T65 work ? -jg
Reply by ●May 26, 20072007-05-26
Jim Granville wrote:> That's a lot of ground to make up. Is the 'fat' in any one area ? > Does the 797LE version have BCD and Interrupts ?Looks like it has interrupts, but no BCD.> Err, why not use/improve the T65 work ?It's more fun to implement it myself :-) I've started a new version, see below. Now it is more clean VHDL code and it should need very few LEs, but a ROM of maybe 1 kbyte. Every microcode is executed in one clock cycle. I plan to implement a call/return microcode, with a callstack size of 1 address, too, for helping to reduce the ROM size (e.g. most addressing modes can be implemented in subroutines). For writing the microcode program and creating the MIF file, I'll write a Lisp program again. library IEEE; use IEEE.STD_LOGIC_1164.ALL; use IEEE.STD_LOGIC_ARITH.ALL; use IEEE.STD_LOGIC_UNSIGNED.ALL; use work.ALL; entity t_rex_test is port( clock_50mhz: in std_logic; led: out unsigned(7 downto 0); button: in unsigned(3 downto 0); dip_switch: in unsigned(3 downto 0); neg_reset: in std_logic); end entity t_rex_test; architecture rtl of t_rex_test is -- bit position in microcode for indicating the last -- microcode command in a program constant mcode_stop_bit : integer:= 7; -- type for CPU addresses subtype address_type is std_logic_vector(15 downto 0); -- type for CPU data words subtype data_type is std_logic_vector(7 downto 0); -- microcode commands constant mcode_load_pc : data_type := x"01"; constant mcode_store_address : data_type := x"02"; -- CPU RAM signals signal address : address_type; signal data : data_type; signal q : data_type; signal wren : std_logic := '0'; -- microcode ROM signals signal mcode_address : std_logic_vector(8 downto 0); signal mcode_q : data_type; signal mcode_code : data_type; signal mcode_stop : boolean; -- scratch register signal working : address_type := x"0200"; -- current command signal command : data_type; -- CPU registers signal pc : address_type := x"0200"; signal sp : address_type := x"01ff"; signal accu : data_type; signal x : data_type; signal y : data_type; signal z_flag : std_logic; signal n_flag : std_logic; signal c_flag : std_logic; signal v_flag : std_logic; signal i_flag : std_logic; signal d_flag : std_logic; -- CPU statemachine type cpu_state_type is ( read_command_state, wait_for_read_state, read_memory_state, wait_for_mcode_index, read_mcode_index, execute_mcode, read_mcode ); signal cpu_state : cpu_state_type := read_command_state; begin -- CPU RAM instance_ram: entity ram port map ( address => address(11 downto 0), clock => clock_50mhz, data => data, wren => wren, q => q ); -- microcode ROM instance_microcode: entity microcode port map ( address => mcode_address, clock => clock_50mhz, data => x"00", wren => '0', q => mcode_q ); -- read command and execute microcode process(clock_50mhz, neg_reset) begin if neg_reset = '1' then pc <= x"0200"; sp <= x"01ff"; accu <= x"00"; x <= x"00"; y <= x"00"; z_flag <= '0'; n_flag <= '0'; elsif rising_edge(clock_50mhz) then case cpu_state is -- read next command when read_command_state => address <= pc; cpu_state <= wait_for_read_state; when wait_for_read_state => cpu_state <= read_memory_state; -- use command as index in first 256 ROM bytes when read_memory_state => pc <= pc + 1; mcode_address <= '0' & q; cpu_state <= wait_for_mcode_index; when wait_for_mcode_index => cpu_state <= read_mcode_index; -- microcode program starts at index + 256 when read_mcode_index => mcode_address <= '1' & mcode_q; cpu_state <= execute_mcode; when execute_mcode => mcode_address <= mcode_address + 1; cpu_state <= read_mcode; -- execute microcode program when read_mcode => mcode_address <= mcode_address + 1; case mcode_code is -- copy pc to working register when mcode_load_pc => working <= pc; -- store working register to RAM address register when mcode_store_address => address <= working; -- unknown microcommand, read next command when others => cpu_state <= read_command_state; end case; -- if mcode_stop_bit was set, read next command if mcode_stop then cpu_state <= read_command_state; end if; end case; end if; end process; mcode_code <= mcode_q and "01111111"; mcode_stop <= true when mcode_q(mcode_stop_bit) = '1' else false; end architecture rtl; -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Reply by ●May 27, 20072007-05-27
Frank Buss wrote:> Jim Granville wrote: > > >>That's a lot of ground to make up. Is the 'fat' in any one area ? >>Does the 797LE version have BCD and Interrupts ? > > > Looks like it has interrupts, but no BCD. > > >>Err, why not use/improve the T65 work ? > > > It's more fun to implement it myself :-)OK.> I've started a new version, see below. Now it is more clean VHDL code and > it should need very few LEs, but a ROM of maybe 1 kbyte. Every microcode is > executed in one clock cycle.ROM makes sense, pretty much every FPGA these days have these for free, and they should be use more in Soft CPU designs.> I plan to implement a call/return microcode, > with a callstack size of 1 address, too, for helping to reduce the ROM size > (e.g. most addressing modes can be implemented in subroutines). For writing > the microcode program and creating the MIF file, I'll write a Lisp program > again.Let us know how the different approach impacts LE count. -jg
Reply by ●May 27, 20072007-05-27
Nice work Frank! I haven't looked in detail at your work, but the general idea of doing nostalgic implementations, doing it for fun and doing it in a minimum resource-fashion is my cup of tea. Just one suggestion that is something that I'm on right now (or... one of the things I'm on right now). If you are willing to sacrifice ALOT FMAX to save FPGA-resources maybe an inner CPU with a very simple instructions-set could do? By doing this and building the instruction-set in reusuable pieces I think there are potential for resource-gains to earn. But if you remember the speed these hogs were doing in the wild days (1,2,4,8 Mhz) maybe similar preformance is still OK. OK, you won't win prices in minimum-power-usage, in readability or probably in anything but I have a hunch this is the way to achieve MAXIMUM usage of resources. Me, myself have been working on implementing a minimum 68K-core this way. Still alot of work left todo, but the current reading of 8% of a Spartan3-200K is quite nice.. My goal is to, at least, get something working in about 20% of a Spartan3-200K. /Magnus On May 27, 7:20 am, Jim Granville <no.s...@designtools.maps.co.nz> wrote:> Frank Buss wrote: > > Jim Granville wrote: > > >>That's a lot of ground to make up. Is the 'fat' in any one area ? > >>Does the 797LE version have BCD and Interrupts ? > > > Looks like it has interrupts, but no BCD. > > >>Err, why not use/improve the T65 work ? > > > It's more fun to implement it myself :-) > > OK. > > > I've started a new version, see below. Now it is more clean VHDL code and > > it should need very few LEs, but a ROM of maybe 1 kbyte. Every microcode is > > executed in one clock cycle. > > ROM makes sense, pretty much every FPGA these days have these for free, > and they should be use more in Soft CPU designs. > > > I plan to implement a call/return microcode, > > with a callstack size of 1 address, too, for helping to reduce the ROM size > > (e.g. most addressing modes can be implemented in subroutines). For writing > > the microcode program and creating the MIF file, I'll write a Lisp program > > again. > > Let us know how the different approach impacts LE count. > > -jg
Reply by ●May 27, 20072007-05-27
spartan3wiz wrote:> If you are willing to sacrifice ALOT FMAX to save FPGA-resources > maybe an inner CPU with a very simple instructions-set could do?Yes, this was my idea. I have enhanced my FPGA implementation to a Forth-like CPU, this is the current version: http://www.frank-buss.de/vhdl/t_rex_test2.vhdl It has the following microcodes: call load-pc load-address load-q load-accu load-x load-y store-pc store-address store-data store-accu store-x store-y inc dec add lshift-8 or nop The load and store commands loads and stores from the specified register to an internal stack (stack size is configurable). "call" executes a program at the location specified in the next byte. A return is implemented, if bit 7 is set in the microcode. The rest are instructions needed to make it simpler to implement the 6502 instruction set, e.g. "or" pops the first two values from stack, does a binary OR and pushs the result back to stack. Testing the microcodes with a simulator is too time consuming, so I've implemented an emulator in Lisp, which creates the opcode ROM, too and the constant list for the microcodes for pasting into the VHDL code: http://www.frank-buss.de/vhdl/cpu2.lisp Playing with it is really nice, e.g. this is the output of an interactive session: CL-USER > (dump #x1f00) 1F00: 00 00 00 00 00 00 00 00 NIL CL-USER > (execute-command) current registers: a: 00, x: 00, y: 00 pc: 0200, mcode-address: 0000, executing microcode: CALL a: 00, x: 00, y: 00 pc: 0201, mcode-address: 0119, executing microcode: LOAD-PC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011A, executing microcode: STORE-ADDRESS a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011B, executing microcode: LOAD-PC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011C, executing microcode: INC a: 00, x: 00, y: 00 pc: 0201, mcode-address: 011D, executing microcode: STORE-PC a: 00, x: 00, y: 00 pc: 0202, mcode-address: 011E, executing microcode: LOAD-Q a: 00, x: 00, y: 00 pc: 0202, mcode-address: 012B, executing microcode: STORE-ACCU a: 2A, x: 00, y: 00 pc: 0202, mcode-address: 012C, NIL CL-USER > (execute-command) current registers: a: 2A, x: 00, y: 00 pc: 0202, mcode-address: 012C, executing microcode: CALL a: 2A, x: 00, y: 00 pc: 0203, mcode-address: 010C, executing microcode: CALL a: 2A, x: 00, y: 00 pc: 0203, mcode-address: 0106, ... NIL CL-USER > (dump #x1f00) 1F00: 2A 00 00 00 00 00 00 00 NIL This was the executing of the following small program, compiled with cc65: .org $200 lda #42 sta $1f00 Now I can implement and test the rest very fast, because I can add debugging output very easily, implement all addressing modes of one instruction with a higher-level instruction to avoid (Lisp) code duplication etc. On the FPGA side I have to simulate the microcodes, only. If every microcode does what it should do, then all microcode programs should work immediatly, because they were tested with my Lisp prorgam before.> OK, you won't win prices in minimum-power-usage, in readability or > probably in anything but I have a hunch this is the way to achieve > MAXIMUM usage of resources.I think readability is very good (ok, maybe because I know Forth and Lisp) and power-usage should be good, too, because fewer LEs are used. My current Forth FPGA implementation needs 319 LEs (about 5% of the small Cyclone EP1C6Q240C8). But I expect 10 times slower than e.g. the T65, so the all in all cycles per power would be not so good.> Me, myself have been working on > implementing a minimum 68K-core this way. Still alot of work left > todo, but the current reading of 8% of a Spartan3-200K is quite nice.. > My goal is to, at least, get something working in about 20% of a > Spartan3-200K.Nice. How does your microcode looks like? Some instructions are very similar to the 6502, so maybe we can develop the perfect microcode for both :-) -- Frank Buss, fb@frank-buss.de http://www.frank-buss.de, http://www.it4-systems.de
Reply by ●May 27, 20072007-05-27
Frank Buss <fb@frank-buss.de> writes:> I've implemented a first version of a 6502 core. It has a very simple > architecture: First the command is read and then for every command a list > of microcodes are executed, controlled by a state machine. To avoid the > redundant VHDL typing, the VHDL code is generated with a Lisp program: >[...]> > Any ideas how to improve the Lisp code? I like my idea of using a lambda > function in addressing-commands, because this looks more clean than a > macro, which I've tried first, but I don't like the explicit call of > emit-lines. How can I refactor it to a more DSL like approach?(Followup was set to comp.arch.fpga, but since this is Lisp-related, I've changed followup to c.l.l.) It seems to me that a more natural way to represent this code in a Lisp program would be to use some form of syntax trees. For example, the VHDL statement if q = x"00" then z_flag <= '1'; else z_flag <= '0'; end if; could be represented with this tree: (if (= q #x00) (<= z_flag #\1) (<= z_flag #\0)) Producing VHDL code from the tree representation should be straight-forward. Peter Seibel's book has a chapter on HTML-generation which might be helpful. You might also want to look up "abstract syntax" or "syntax trees" in any compiler textbook. Sven-Olof Nystr|m
Reply by ●May 27, 20072007-05-27
Ahh nice! I think we have connected brains! :-) Actually my idea isn't that refined yet. I am doing the long-way- around the problem. After seeing some of the different retro-cores out there (6502, 8051,Z80 and 68K) that I think took just too much resource off my poor small FPGA, I decided that it must be possible to implement something smaller and still have full functionality. The people who have implemented these cores straight in VHDL/Verilog are fantastic people way over my current limit I think, so all cheers to you! If I/we don't need the 50MHz (or so) that can be achieved in today's standard hobby FPGA-development-cards, maybe the FMAX CAN be, in some way, converted into reused blocks. I started my 68K-project by using the fantastic, already super- optimized Picoblaze of Ken Chapman that are Xilinx-specific and SMALL: http://en.wikipedia.org/wiki/PicoBlaze http://www.xilinx.com/ipcenter/processor_central/picoblaze/picoblaze_user_resources.htm By shifting over to, for example the Pacoblaze: http://bleyer.org/pacoblaze/ I can then make it run on anything and still be quite small! My work is only half-way finished and I'm not sure It'll fit into the tight program-space, but by taking on hard projects at least you learn something I think. To fit this into the program space of the Picoblaze I need to do MAXIMUM reuse of the assembler code, thus crystallizing out the nice reusable parts into assembler sub-routines. Maybe even finding reuse in places where it otherwise might be missed.. By using an 8-bit CPU to emulate an 16-bit CPU I can save resources but get a hard performance-hit. It is very time-consuming doing these tests and there are lots of stuff left to fix, for example the memory access problem, but I'm keep trying until... well until I don't feel like it! :-) Then when I'm finished (and have something working), I have several next step possibilities of which I would like to try all. 1) Just keep the slow small core making sure it run on Picoblaze (xilinx hardware) as well as Pacoblaze (anything..) and does its job. 2) Try removing all unused instructions from the 8-bit CPU's instruction set, thus making it even smaller BUT destroying the possible future upgrades and removing the compatibility of the internal parts (this would only be pubhished as already compiled BIT- files I think..) 3) Try adding extra instructions (from implementing the assembler sub- routines into new instructions) by looking at profiling of solution 1) running. By doing this we can find the perfect balance (or several balances) between size/speed depending on the demands on the goal circuit usage. 4) A combination of 2) and 3) 5) Maybe building something completely new out of the things learned from all the above.. But your thoughts on doing something generic just sounds NICE! A nice tool that kept me going this far is: http://www.mediatronix.com/pBlazeIDE.htm /Magnus On 27 Maj, 19:36, Frank Buss <f...@frank-buss.de> wrote:> spartan3wiz wrote: > > If you are willing to sacrifice ALOT FMAX to save FPGA-resources > > maybe an inner CPU with a very simple instructions-set could do? > > Yes, this was my idea. I have enhanced my FPGA implementation to a > Forth-like CPU, this is the current version: > > http://www.frank-buss.de/vhdl/t_rex_test2.vhdl > > It has the following microcodes: > > call > load-pc > load-address > load-q > load-accu > load-x > load-y > store-pc > store-address > store-data > store-accu > store-x > store-y > inc > dec > add > lshift-8 > or > nop > > The load and store commands loads and stores from the specified register to > an internal stack (stack size is configurable). "call" executes a program > at the location specified in the next byte. A return is implemented, if bit > 7 is set in the microcode. The rest are instructions needed to make it > simpler to implement the 6502 instruction set, e.g. "or" pops the first two > values from stack, does a binary OR and pushs the result back to stack. > > Testing the microcodes with a simulator is too time consuming, so I've > implemented an emulator in Lisp, which creates the opcode ROM, too and the > constant list for the microcodes for pasting into the VHDL code: > > http://www.frank-buss.de/vhdl/cpu2.lisp > > Playing with it is really nice, e.g. this is the output of an interactive > session: > > CL-USER > (dump #x1f00) > > 1F00: 00 00 00 00 00 00 00 00 > NIL > > CL-USER > (execute-command) > current registers: > a: 00, x: 00, y: 00 > pc: 0200, mcode-address: 0000, > > executing microcode: CALL > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 0119, > > executing microcode: LOAD-PC > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 011A, > > executing microcode: STORE-ADDRESS > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 011B, > > executing microcode: LOAD-PC > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 011C, > > executing microcode: INC > a: 00, x: 00, y: 00 > pc: 0201, mcode-address: 011D, > > executing microcode: STORE-PC > a: 00, x: 00, y: 00 > pc: 0202, mcode-address: 011E, > > executing microcode: LOAD-Q > a: 00, x: 00, y: 00 > pc: 0202, mcode-address: 012B, > > executing microcode: STORE-ACCU > a: 2A, x: 00, y: 00 > pc: 0202, mcode-address: 012C, > > NIL > > CL-USER > (execute-command) > current registers: > a: 2A, x: 00, y: 00 > pc: 0202, mcode-address: 012C, > > executing microcode: CALL > a: 2A, x: 00, y: 00 > pc: 0203, mcode-address: 010C, > > executing microcode: CALL > a: 2A, x: 00, y: 00 > pc: 0203, mcode-address: 0106, > > ... > > NIL > > CL-USER > (dump #x1f00) > > 1F00: 2A 00 00 00 00 00 00 00 > NIL > > This was the executing of the following small program, compiled with cc65: > > .org $200 > lda #42 > sta $1f00 > > Now I can implement and test the rest very fast, because I can add > debugging output very easily, implement all addressing modes of one > instruction with a higher-level instruction to avoid (Lisp) code > duplication etc. > > On the FPGA side I have to simulate the microcodes, only. If every > microcode does what it should do, then all microcode programs should work > immediatly, because they were tested with my Lisp prorgam before. > > > OK, you won't win prices in minimum-power-usage, in readability or > > probably in anything but I have a hunch this is the way to achieve > > MAXIMUM usage of resources. > > I think readability is very good (ok, maybe because I know Forth and Lisp) > and power-usage should be good, too, because fewer LEs are used. My current > Forth FPGA implementation needs 319 LEs (about 5% of the small Cyclone > EP1C6Q240C8). But I expect 10 times slower than e.g. the T65, so the all in > all cycles per power would be not so good. > > > Me, myself have been working on > > implementing a minimum 68K-core this way. Still alot of work left > > todo, but the current reading of 8% of a Spartan3-200K is quite nice.. > > My goal is to, at least, get something working in about 20% of a > > Spartan3-200K. > > Nice. How does your microcode looks like? Some instructions are very > similar to the 6502, so maybe we can develop the perfect microcode for both > :-) > > -- > Frank Buss, f...@frank-buss.dehttp://www.frank-buss.de,http://www.it4-systems.de
Reply by ●May 27, 20072007-05-27
Frank Buss wrote: <snip>> > On the FPGA side I have to simulate the microcodes, only. If every > microcode does what it should do, then all microcode programs should work > immediatly, because they were tested with my Lisp prorgam before.Just checking if you have seen the work of Jan Decaluwe ? http://myhdl.jandecaluwe.com/doku.php/start> >>OK, you won't win prices in minimum-power-usage, in readability or >>probably in anything but I have a hunch this is the way to achieve >>MAXIMUM usage of resources. > > > I think readability is very good (ok, maybe because I know Forth and Lisp) > and power-usage should be good, too, because fewer LEs are used. My current > Forth FPGA implementation needs 319 LEs (about 5% of the small Cyclone > EP1C6Q240C8). But I expect 10 times slower than e.g. the T65, so the all in > all cycles per power would be not so good.If this runs slower, one of my pet ideas for FPGA cores, is to design them to run from SerialFLASH memory. Top end ones (winbond) run at 150MBd of link speed, so can feed nearly 20MB/s of streaming code. Ideally, the core has a short-skip opcode, as the jump in such memory has a higher cost. -jg
Reply by ●May 28, 20072007-05-28
On Mon, 28 May 2007 07:34:24 +1200, Jim Granville <no.spam@designtools.maps.co.nz> wrote:>Frank Buss wrote: ><snip>> >If this runs slower, one of my pet ideas for FPGA cores, is to design >them to run from SerialFLASH memory. Top end ones (winbond) run at >150MBd of link speed, so can feed nearly 20MB/s of streaming code. >Ideally, the core has a short-skip opcode, as the jump in such memory >has a higher cost.Or a "four address instruction" like the Pilot Ace, with SerialFlash in place of a tube full of mercury? - Brian