I have a general purpose soft processor core that I developed in verilog. = The processor is unusual in that it uses four indexed LIFO stacks with expl= icit stack pointer controls in the opcode. It is 32 bit, 2 operand, fully = pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain basemen= t Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs. Th= e design is relatively simple (as these things go) yet powerful enough to d= o real work. I wrote a fairly extensive paper describing the processor, and am about to = post it and my code over at opencores.org, but was thinking the paper and t= he concepts might be good enough for a more formal publication. Any sugges= tions on who might be interested in publishing it?
New soft processor core paper publisher?
An FPGA engineer seeks advice on where to formally publish a paper describing a custom 32-bit, 8-threaded soft processor that utilizes an unusual architecture of four indexed LIFO stacks. The discussion explores the trade-offs between academic journals, FPGA conferences, and open-source platforms like OpenCores for a design that targets deterministic timing and manual assembly coding rather than a full C/C++ toolchain.
The community recommends providing comparative benchmarks, thorough documentation, and hardware verification to gain credibility, while also suggesting specific venues like IEEE Trans on VLSI, ACM TECS, or more hobbyist-friendly web equivalents.
- A journal paper (IEEE or ACM) is recommended for modern architectural innovations, while FPGA conferences are better for implementation-focused work.
- Successful publication or adoption requires quantitative benchmarks comparing LE usage, BRAM, and instruction cycles against established cores like MicroBlaze or Nios-II.
- The author prioritizes 'correct by construction' design and manual Verilog-based programming over complex toolchains or OS support.
- Hardware verification on actual FPGA silicon is considered essential by the community to validate simulation-based timing and resource claims.
Hi, I was in a similar position about 5 years ago. My own processor is the ByoRISC, a RISC-like extensible custom processor supporting multiple- input, multiple-output custom instructions.> I have a general purpose soft processor core that I developed in verilog.==A0The processor is unusual in that it uses four indexed LIFO stacks with = explicit stack pointer controls in the opcode. =A0It is 32 bit, 2 operand, = fully pipelined, 8 threads, and produces an aggregate 200 MIPs in bargain b= asement Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LE= s. =A0The design is relatively simple (as these things go) yet powerful eno= ugh to do real work. This reads like a "fourstack" architecture on steroids. It seems good! How do you compare with more classic RISC-like soft-cores like MicroBlaze, Nios-II, LEON, etc? There is also a classic book on stack-based computers, you really need to go through this and reference it in your publication.> I wrote a fairly extensive paper describing the processor, and am about t=o post it and my code over at opencores.org, but was thinking the paper and= the concepts might be good enough for a more formal publication. =A0Any su= ggestions on who might be interested in publishing it? I had chosen to publish to VLSI-SoC 2008 (due to proximity, that year it was held in Greece). It is an OK conference, however, not indexed well by DBLP and the likes. Anyway, here is a link to my submitted version of the paper: http://www.nkavvadias.com/publications/kavvadias_vlsisoc08.pdf The paper was really well accepted at the conference venue. I had received some of my best reviews. However, I didn't had the chance to present the paper in person, because I was in really deep-S in the army and couldn't get a three- day special leave for the conference. (I joined the army at 31, so I was s'thing like an elderly private :). For instance, I was about the same age as all the majors in the camp. Only colonels and people among permanent staff where older. On the contrary, I had a hard-time to publish an extended/long version of the paper as a journal paper. All three publishers were arguing about the existence of the conference paper, and that due to this fact, no journal paper version was necessary (even with ~40% material additions). My suggestion is to: a) go for the journal paper (e.g. IEEE Trans. on VLSI or ACM TECS if you have s'thing really modern) b) otherwise submit to an FPGA or architecture conference. It depends on where you live, there are numerous European and worldwide conferences with processor-related topics (FPGA-based architectures, GPUs, ASIPs, novel architectures, manycores, etc). In all cases you may have to adapt your material (e.g. due to page limits) to the conventions of the publisher. BTW another more recent example is the paper on the iDEA DSP soft-core processor: http://www.ntu.edu.sg/home/sfahmy/files/papers/fpt2012-cheah.pdf This looks like a lean, mean architecture well-opted for contemporary FPGAs. Hope these help. Best regards, Nikolaos Kavvadias http://www.nkavvadias.com
>I have a general purpose soft processor core that I developed in verilog.=>The processor is unusual in that it uses four indexed LIFO stacks withexpl=>icit stack pointer controls in the opcode. It is 32 bit, 2 operand, fully=>pipelined, 8 threads, and produces an aggregate 200 MIPs in bargainbasemen=>t Altera Cyclone 3 and 4 speed grade 8 parts while consuming ~1800 LEs.Th=>e design is relatively simple (as these things go) yet powerful enough tod=>o real work. > >I wrote a fairly extensive paper describing the processor, and am about to=>post it and my code over at opencores.org, but was thinking the paper andt=>he concepts might be good enough for a more formal publication. Anysugges=>tions on who might be interested in publishing it? >Do you also have an assembler, C++ compiler and debugger for this beast? You should have a reference design running on a FPGA board if you want to attract a following. Ideally it should also run linux. Why can't you do both. Post the code to opencores.org and then write a paper about it and publish. John --------------------------------------- Posted through http://www.FPGARelated.com
Thank you for your reply Nikolaos!> This reads like a "fourstack" architecture on steroids. It seems > good!"A Four Stack Processor" by Bernd Paysan? I ran across that paper several = years ago (thanks!). Very interesting, but with multiple ALUs, access to d= ata below the LIFO tops, TLBs, security, etc. it is much more complex than = my processor. It looks like a real bear to program and manage at the lowes= t level.> How do you compare with more classic RISC-like soft-cores like > MicroBlaze, Nios-II, LEON, etc?The target audience for my processor is an FPGA developer who needs to impl= ement complex functionality that tolerates latency but requires determinist= ic timing. Hand coding with no toolchain (verilog initial statement boot c= ode). Simple enough to keep the processor model and current state in one's= head (with room to spare). Small enough to fit in the smallest of FPGAs (= with room to spare). Not meant at all to run a full-blown OS, but not a tr= ivial processor.> There is also a classic book on stack-based computers, you really need > to go through this and reference it in your publication."Stack Computers: The New Wave" by Philip J. Koopman, Jr.? Also ran across= that many years ago (thanks!). The main thrust of it seems to be the advo= cating of single data stack, single return stack, zero operand machines, wh= ich I feel (nothing personal) are crap. Easy to design and implement (I've= made several while under the spell) but impossible to program in an effici= ent manner (gobs of real time wasted on stack thrash, the minimization of w= hich leads directly to unreadable procedural coding practices, which leads = to catastrophic stack faults).> On the contrary, I had a hard-time to publish an extended/long version > of the paper as a journal paper. All three publishers were arguing > about the existence of the conference paper, and that due to this > fact, no journal paper version was necessary (even with ~40% material > additions).Hmm. The last thing I want is to have my hands tied when I'm trying to giv= e something away for free. But my paper would likely benefit from external= editorial input.> My suggestion is to: > a) go for the journal paper (e.g. IEEE Trans. on VLSI or ACM TECS if > you have s'thing really modern)My processor incorporates what I believe are a couple of new innovations (b= ut who ever really knows?) that I'd like to get out there if possible. And= I wouldn' mind a bit of personal recognition if only for my efforts. IEEE is probably out. I fundamentally disagree with the hoarding of tecnic= al papers behind a greedy paywall.> BTW another more recent example is the paper on the iDEA DSP soft-core > processor: > http://www.ntu.edu.sg/home/sfahmy/files/papers/fpt2012-cheah.pdfWow, very nice paper describing a very nice design, thanks!
Thanks for the response John!> Do you also have an assembler, C++ compiler and debugger for this beast? > You should have a reference design running on a FPGA board if you want to > attract a following. Ideally it should also run linux.See my response to Nikolaos above. Full-blown OS support was not the development target. But it's not a pico-blaze either. Somewhere in the middle, mainly for FPGA algorithms that can benefit from serialization.> Why can't you do both. Post the code to opencores.org and then write a > paper about it and publish.That's probably the route I'll end up taking.
Eric Wallin <tammie.eric@gmail.com> wrote:> Thanks for the response John! > > > Do you also have an assembler, C++ compiler and debugger for this beast? > > You should have a reference design running on a FPGA board if you want to > > attract a following. Ideally it should also run linux. > > See my response to Nikolaos above. Full-blown OS support was not the > development target. But it's not a pico-blaze either. Somewhere in the > middle, mainly for FPGA algorithms that can benefit from serialization.Benchmarks. Tell us why we should use your processor. How does it win compared with the alternatives? How easy is it to program? An assembler or a C compiler are really necessary to make something usable - LLVM may come in handy as a C compiler toolkit, I'm not sure what's an equivalent assembler toolkit. Actually synthesise the thing. It's hard to take seriously something that's never actually been tested for real, especially if it makes assumptions like having gigabytes of single-cycle-latency memory. Debug it and make sure it works in real hardware.> > Why can't you do both. Post the code to opencores.org and then write a > > paper about it and publish.If you put it on opencores, document document document. There are tons of half-baked projects with lame or nonexistent documentation, that kind of half work on the author's dev system but fall over in real life for one reason or another. Is it vendor-independent, or does it use Xilinx/Altera/etc special stuff? If so, how easily can that be replaced with an alternative vendor? Regression tests and test suites. How do we know it's working? Can we work on the code and make sure we don't break anything? What does 'working' mean in the first place? If you're trying to make an argument in computer architecture you can get away without some of this stuff (a research prototype can have rough edges because it's only to prove a point, as long as you tell us what they are). Generally you need to tell a convincing story, and either the story is that XYZ is a useful approach to take (so we can throw away the prototype and build something better) or XYZ is a component people should use (when it becomes more convincing if there's more support) Some lists of well-known conferences: http://sites.google.com/site/calasweb/fpga-conferences-and-workshops http://tcfpga.org/conferences.html Good luck :) Theo
Thanks for your response Theo! On Thursday, June 13, 2013 8:44:03 PM UTC-4, Theo Markettos wrote:> Benchmarks. Tell us why we should use your processor. How does it win > compared with the alternatives?Good point. So far I've coded a verification boot code gauntlet that it ha= s passed, as well as restoring division and log2. If I had more code to pu= sh through it I could statistically tailor the instruction set (size the im= mediates, etc.) but I don't. I may at some point but I may not either. Th= is is mainly for me, to help me to implement various projects that require = complex computations in an FPGA (I currently need it for a digital Theremin= that is under development), but I want to release it so others may examine= and possibly use it or help me make it better, or use some of the ideas in= there in their own stuff.> How easy is it to program? An assembler or a C compiler are really > necessary to make something usable - LLVM may come in handy as a C compil=er> toolkit, I'm not sure what's an equivalent assembler toolkit.It's fairly general purpose and I think if you read the paper you might (or= might not) find it easy to understand and program by hand using verilog in= itial statements. My main goals were that it be simple enough to grasp wit= hout tools, complex and fast enough to do real things, have compact opcodes= so BRAM isn't wasted, etc. A compiler, OS, etc. are overkill and definite= ly not the intended target. There is a middle ground between trivial and full-blown processors (particu= larly for FPGA logical use). Of all the commercial offerings in this range= that I'm aware of, my processor is probably most similar to the Parallax P= ropeller, which is almost certainly pipeline threaded (though they don't te= ll you that in the documentation). The Propeller and has a video generator= ; character, sine, and log tables; and other stuff mine doesn't. But mine = has a simpler, more unified overall concept and programming model. It is a= true middle ground between register and stack machines.> Actually synthesise the thing. It's hard to take seriously something tha=t's> never actually been tested for real, especially if it makes assumptions l=ike> having gigabytes of single-cycle-latency memory. Debug it and make sure =it> works in real hardware.Not trying to argue from authority, but I've got 10 years of professional H= DL experience, and have made several processors in the past for my own edif= ication and had them up and running on Xilinx demo boards. This one hasn't= actually run in the flesh yet, but it has gone through the build process m= any times and has been pretty thoroughly verified, so I would be amazed if = there were any issues (famous last words). But I'll run it on a Cyclone IV= board before releasing it.> If you put it on opencores, document document document. There are tons o=f> half-baked projects with lame or nonexistent documentation, that kind of > half work on the author's dev system but fall over in real life for one > reason or another. =20I know what you mean, I never use any code directly from there. To be fair= , most of the code I ran across in industry was fairly poor as well. Anywa= y, I've got a really nice document that took me about a month to write, wit= h lots of drawings, tables, examples, etc. describing the design and my tho= ughts behind it. Even if people don't particularly like my processor they = might be able to get something out of the background info in the paper (FPG= A multipliers and RAM, LIFO & ALU design, pipelining, register set construc= tion, etc.).> Is it vendor-independent, or does it use Xilinx/Altera/etc special stuff?==20> If so, how easily can that be replaced with an alternative vendor?I was careful to not use vendor specific constructs in the verilog. The bl= ock RAM for main memory the the stacks is inferred, as are the ALU signed m= ultipliers. I spent a long time on the modular partitioning of the code wi= th a strong eye towards verification (as I usually do). The code was devel= oped in Quartus, and has been compiled many, many times, but I haven't run = it through XST yet.> Regression tests and test suites. How do we know it's working? Can we w=ork> on the code and make sure we don't break anything? What does 'working' m=ean> in the first place?I'm probably an odd man out, but I don't agree with a lot of "standard" ind= ustry verification methodology. Test benches are fine for really complex c= ode and / or data environments, but there is no substitute for good coding,= proper modular partitioning, and thorough hand testing of each module. I'= ve seen too many out of control projects with designers throwing things ove= r various walls, leaving the verification up to the next guy who usually is= n't familiar enough with it to really bang on the sensitive parts. And I k= ind of hate modelsim. Anyone that codes should spend a lot of time verifying - I do, and for the = most part really enjoy it. The industry has turned this essential activity= into something most people loathe, so it just doesn't happen unless people= get pushed into doing it. And even then it usually doesn't get done very = thoroughly. Co-developing in environments like that is a nightmare.> Some lists of well-known conferences: > http://sites.google.com/site/calasweb/fpga-conferences-and-workshops > http://tcfpga.org/conferences.htmlThanks, I'll check them out!
Eric Wallin <tammie.eric@gmail.com> wrote:> Thanks for your response Theo! > > On Thursday, June 13, 2013 8:44:03 PM UTC-4, Theo Markettos wrote: > > > Benchmarks. Tell us why we should use your processor. How does it win > > compared with the alternatives? > > Good point. So far I've coded a verification boot code gauntlet that it > has passed, as well as restoring division and log2. If I had more code to > push through it I could statistically tailor the instruction set (size the > immediates, etc.) but I don't. I may at some point but I may not either. > This is mainly for me, to help me to implement various projects that > require complex computations in an FPGA (I currently need it for a digital > Theremin that is under development), but I want to release it so others > may examine and possibly use it or help me make it better, or use some of > the ideas in there in their own stuff.FWIW 'benchmarks' doesn't necessarily mean running SPECfoo at 2.7 times quicker than a 4004, but things like 'how many instructions does it take to write division/FFT/quicksort/whatever' compared with the leading brand. Or how many LEs, BRAMs, mW, etc. Numbers are good (as is publishing the source so we can reproduce them).> > How easy is it to program? An assembler or a C compiler are really > > necessary to make something usable - LLVM may come in handy as a C compiler > > toolkit, I'm not sure what's an equivalent assembler toolkit. > > It's fairly general purpose and I think if you read the paper you might > (or might not) find it easy to understand and program by hand using > verilog initial statements. My main goals were that it be simple enough > to grasp without tools, complex and fast enough to do real things, have > compact opcodes so BRAM isn't wasted, etc. A compiler, OS, etc. are > overkill and definitely not the intended target.Fair enough. If you're making architectural points, you can probably get away with assembly examples. A simple assembler is good for developer sanity, though. Could probably be knocked up in Python reasonably fast.> > Actually synthesise the thing. It's hard to take seriously something that's > > never actually been tested for real, especially if it makes assumptions like > > having gigabytes of single-cycle-latency memory. Debug it and make sure it > > works in real hardware. > > Not trying to argue from authority, but I've got 10 years of professional > HDL experience, and have made several processors in the past for my own > edification and had them up and running on Xilinx demo boards. This one > hasn't actually run in the flesh yet, but it has gone through the build > process many times and has been pretty thoroughly verified, so I would be > amazed if there were any issues (famous last words). But I'll run it on a > Cyclone IV board before releasing it.I'm just a bit jaded from seeing papers at conferences where somebody wrote some verilog which they only ran in modelsim, and never had to worry about limited BRAM, or meeting timing, or multiple clock domains, or...> > If you put it on opencores, document document document. There are tons of > > half-baked projects with lame or nonexistent documentation, that kind of > > half work on the author's dev system but fall over in real life for one > > reason or another. > > I know what you mean, I never use any code directly from there. To be > fair, most of the code I ran across in industry was fairly poor as well. > Anyway, I've got a really nice document that took me about a month to > write, with lots of drawings, tables, examples, etc. describing the > design and my thoughts behind it. Even if people don't particularly like > my processor they might be able to get something out of the background > info in the paper (FPGA multipliers and RAM, LIFO & ALU design, > pipelining, register set construction, etc.).This is good. Just a thought - could you angle it as 'how to do processor design' using your processor as a case study? That makes it more of a useful tutorial than 'buy our brand, it's great'...> I'm probably an odd man out, but I don't agree with a lot of "standard" > industry verification methodology. Test benches are fine for really > complex code and / or data environments, but there is no substitute for > good coding, proper modular partitioning, and thorough hand testing of > each module. I've seen too many out of control projects with designers > throwing things over various walls, leaving the verification up to the > next guy who usually isn't familiar enough with it to really bang on the > sensitive parts. And I kind of hate modelsim.That's not exactly what I meant... let's say you rearrange the pipelining on your CPU. It turns out you introduce some obscure bug that causes branches to jump to the wrong place if there's a multiply 3 instructions back from the branch. How would you know if you did this, and make sure it didn't happen again? Hand testing modules won't catch that. It's worse if there's an OS involved, of course. But it can be easy to introduce stupid bugs when you're refactoring something, and waste a lot of time tracking them down. We use Bluespec so avoid modelsim ;-) (with Jenkins so we run the test suite for every commit. A bit overkill for your needs, perhaps)> Anyone that codes should spend a lot of time verifying - I do, and for the > most part really enjoy it. The industry has turned this essential > activity into something most people loathe, so it just doesn't happen > unless people get pushed into doing it. And even then it usually doesn't > get done very thoroughly. Co-developing in environments like that is a > nightmare.I admit the tools don't always make it easy... Theo
On Friday, June 14, 2013 5:29:15 PM UTC-4, Theo Markettos wrote: =20> FWIW 'benchmarks' doesn't necessarily mean running SPECfoo at 2.7 times > quicker than a 4004, but things like 'how many instructions does it take =to> write division/FFT/quicksort/whatever' compared with the leading brand. =Or> how many LEs, BRAMs, mW, etc. Numbers are good (as is publishing the sou=rce> so we can reproduce them).I have FPGA resource numbers for the Cyclone III target in the paper. Brie= fly, it consumes ~1800 LEs, 4 18x18 multipliers, 4 BRAMs for the stacks, pl= us whatever the main memory needs. This is roughly 1/3 of the smallest Cyc= lone III part. I have a restoring division example in the paper that gives= 197 / 293 cycles best / worst case (a thread cycle is 8 200MHz clocks, but= there are 8 threads running at this speed so aggregate throughput is poten= tially 200 MIPs if all threads are busy doing something). =20 I've seen lots of papers that claim speed numbers but don't give the speed = grade, or tell you what hoops they jumped through to get those speeds. Wit= hout that info the speeds are meaningless.> Fair enough. If you're making architectural points, you can probably get > away with assembly examples. A simple assembler is good for developer > sanity, though. Could probably be knocked up in Python reasonably fast.That's certainly possible. At this point I'm writing code for it directly = in verilog using an initial statement text file that gets included in the m= ain memory. Several define statements make this clearer and actually fairl= y easy. But uploading code to a boot loader would require something like a= n assembler. I'm really trying to stay away from the need for toolsets.> This is good. Just a thought - could you angle it as 'how to do processo=r> design' using your processor as a case study? That makes it more of a > useful tutorial than 'buy our brand, it's great'...The paper is kind of that, background and general how to, but my processor = doesn't have caches, branch prediction, pipeline hazards, TLBs, etc. so peo= ple wanting to know how to do that stuff will come up totally empty.> That's not exactly what I meant... let's say you rearrange the pipelining=on> your CPU. It turns out you introduce some obscure bug that causes branch=es to> jump to the wrong place if there's a multiply 3 instructions back from th=e> branch. How would you know if you did this, and make sure it didn't happ=en> again? Hand testing modules won't catch that.It's correct by construction! ;-) Seriously though, there are no hazards t= o speak of and very little internal state, so branches pretty much either w= ork or they don't. Once basic functionality was confirmed in simulation, I = used processor code to check the processor itself e.g. I wrote some code th= at checks all branches against all possible branch conditions. Each test i= ncrements a count if it passes or decrements if it fails. The final passin= g number can only be reached if all tests pass. I've got simple code like = this to test all of the opcodes. This exercise can help give an early feel= for the completeness of the instruction set as well. Verifying something = like the Pentium must be one agonizingly painful mountain to climb. Verify= ing each silicon copy must be a bear as well.





