comp.arch.fpga | RISC-V Support in FPGA| page 6

Reply by Mark Curry ●May 3, 20172017-05-03

In article <oed1dk$a7b$2@dont-email.me>, rickman  <gnuarm@gmail.com> wrote:
>On 5/3/2017 11:22 AM, Mark Curry wrote:
>> In article <oebfia$o4a$2@dont-email.me>, rickman  <gnuarm@gmail.com> wrote:
>>> On 5/2/2017 5:52 PM, Kevin Neilson wrote:
>>>>> We tested it pretty extensively with Modelsim and Intel FPGA tools; we
>>>>> didn't have enough summer to put it through Xilinx or ASIC tools but happy
>>>>> to fix things if there's any issues.
>>>>>
>>>>> Theo
>>>>
>>>> At first glance I thought I'd seen some object-oriented stuff in there but it was just structs.  I actually used a lot of SystemVerilog a few
>>> years ago when I was only using Synplify, but now I write cores that have to work in a broad range of synthesizers which sadly don't even accept
>>> many Verilog-2005 constructs.
>>>
>>> I wonder what is behind that.  Much of VHDL-2008 is supported in most
>>> tools, at least all the good stuff.  I believe the Xilinx tools don't
>>> include 2008, but I haven't tried it.  Otherwise I'm told the third
>>> party vendors support it and the Lattice tools I've used do a nice job
>>> of it.
>>>
>>> I can't understand a vendor being so behind the times.
>>
>> Rick - yeah, it's pathetic.  The synthesizable subset of SystemVerilog was
>> actually fairly concretely defined in the SystemVerilog 3.1 draft, in 2005.
>> We're just now - 12 years later really finding an acceptable solution for
>> FPGA designs.  To repeat myself - It's really pathetic.
>>
>> Vivado seems to actually have BETTER language support for SystemVerilog than
>> Synplify - believe it or not.  But this only works so far until you hit some
>> sort of corner case and the tool spits out a netlist which doesn't match the
>> RTL.  (We've hit too many of those issues in the past 2-3 years).
>>
>> Synplify, on the other hand barfs on perfectly acceptable, synthesizable code
>> (i.e. SystemVerilog features that already have parallels in VHDL).  But
>> Synplify has never (for us) produced a netlist which doesn't match RTL...
>
>Am I hearing a justification for staying with VHDL rather than learning 
>Verilog as I've been intending for some time?  My understanding is that 
>to write test benches like what VHDL can do it is useful to have 
>SystemVerilog.  Or is this idea overblown?

Rick - I was speaking of Synthesizer support within FPGA tools only.

Simulation support depends entirely on your vendor, and is an entirely 
different beast.  We've been happy with Modelsim for all our SystemVerilog
simulations - for many years.  Can't comment much on other simulation 
vendors, and their support.  I've not used VCS, or NCSIM (or whatever 
they're now called) in many years.  Never tried Xilinx "free" simulators,
but for "free" I'd expect you'd get what you pay for.

I'll not wade any deeper into language wars - use what you're most
comfortable with.  Doesn't hurt to have experience with both.

Regards,

Mark

Reply by Kevin Neilson ●May 3, 20172017-05-03

> Rick - yeah, it's pathetic.  The synthesizable subset of SystemVerilog was 
> actually fairly concretely defined in the SystemVerilog 3.1 draft, in 2005.  
> We're just now - 12 years later really finding an acceptable solution for 
> FPGA designs.  To repeat myself - It's really pathetic.

In my case, I mostly write for Vivado, but I have to write code which will also work for some ASIC synthesis tools which don't like anything too modern.  I'm not sure why; I just know I have to keep to a low common denominator.

Anyway, and this is a different topic altogether, I've reverted to writing very low-level code for Vivado.  I've given up the dream of parameterizable HDL.  I do a lot of Galois Field arithmetic and I put all my parameterization in Matlab and generate Verilog include files (mostly long parameters) from that.  The Verilog then looks about as understandable as assembly and I hate doing it but I have to.  It's the same thing I was doing over ten years ago with Perl but now do with Matlab.  Often Vivado will synthesize the high-level version with functions and nested loops, but it is an order of magnitude slower (synthesis time) than the very low-level version.  And sometimes it doesn't synthesize how I like.  I've just given up on high-level synthesizable code.

Reply by Mark Curry ●May 3, 20172017-05-03

In article <e4a7957e-c42e-4846-b8ab-ebcf170238dd@googlegroups.com>,
Kevin Neilson  <kevin.neilson@xilinx.com> wrote:
>> Rick - yeah, it's pathetic.  The synthesizable subset of SystemVerilog was 
>> actually fairly concretely defined in the SystemVerilog 3.1 draft, in 2005.  
>> We're just now - 12 years later really finding an acceptable solution for 
>> FPGA designs.  To repeat myself - It's really pathetic.
>
>In my case, I mostly write for Vivado, but I have to write code which will also work for some ASIC synthesis tools which don't like anything too
>modern.  I'm not sure why; I just know I have to keep to a low common denominator.
>
>Anyway, and this is a different topic altogether, I've reverted to writing very low-level code for Vivado.  I've given up the dream of
>parameterizable HDL.  I do a lot of Galois Field arithmetic and I put all my parameterization in Matlab and generate Verilog include files (mostly
>long parameters) from that.  The Verilog then looks about as understandable as assembly and I hate doing it but I have to.  It's the same thing I
>was doing over ten years ago with Perl but now do with Matlab.  Often Vivado will synthesize the high-level version with functions and nested loops,
>but it is an order of magnitude slower (synthesis time) than the very low-level version.  And sometimes it doesn't synthesize how I like.  I've just
>given up on high-level synthesizable code.

(continuing a bit OT...)

Kevin,

That's unfortunate.  We've been very successful with writing parameterizable code - even 
before SystemVerilog. Heck even before Verilog-2001.  Things like N-Tap FIRs, 
Two-D FIRs.  FFTs, Video Blenders, etc...  All with configurable settings - 
bit widths, rounding/truncation options/etc..  I think in a previous job I had a 
parametizable Galois Field Multiplier too.

I'm not sure what trouble you had with the tools.  It takes a bit more up front work,
but pays off quite a bit in the end.  We really had no choice, given the number of 
FPGAs we do, along with how many engineers support them.  Lot's of shared code
was the only way to go. 

If you've got something you like, then I suggest keeping it.  But for others,
I think writing parameterizable HDL isn't too much trouble - and is made
even easier with SystemVerilog.  And higher level too.

Regards,

Mark

Reply by Kevin Neilson ●May 3, 20172017-05-03

> (continuing a bit OT...)
> 
> Kevin,
> 
> That's unfortunate.  We've been very successful with writing parameterizable code - even 
> before SystemVerilog. Heck even before Verilog-2001.  Things like N-Tap FIRs, 
> Two-D FIRs.  FFTs, Video Blenders, etc...  All with configurable settings - 
> bit widths, rounding/truncation options/etc..  I think in a previous job I had a 
> parametizable Galois Field Multiplier too.
> 
> I'm not sure what trouble you had with the tools.  It takes a bit more up front work,
> but pays off quite a bit in the end.  We really had no choice, given the number of 
> FPGAs we do, along with how many engineers support them.  Lot's of shared code
> was the only way to go. 
> 
> If you've got something you like, then I suggest keeping it.  But for others,
> I think writing parameterizable HDL isn't too much trouble - and is made
> even easier with SystemVerilog.  And higher level too.
> 
> Regards,
> 
> Mark

I've just been burned too many times.  I know better now.  The last time I made the mistake I was just making a simple PN generator (LFSR).  The only complication was that it was highly parallel--I think I had to generate maybe 512 bits per cycle, so it ends up being a big matrix multiplication over GF(2).  First I made the high-level version where you could set a parameters for the width and taps and so on.  It took forever for Vivado to crank on it.  This is just a few lines of code, mind you, and is just a bunch of XORs.  Then I had Matlab generate an include file with the matrix packed into a long parameter which essentially sets up XOR taps.  That was, I think, ~20x faster, which translated into hours of synthesis time.  The synthesized circuit was also better for various reasons.  This is just one example.  I also still have to instantiate primitives frequently for various reasons.  The level of abstraction doesn't seem like it's changed much in 15 years if you really need performance.  This doesn't really have anything to do with the SystemVerilog constructs.  I'm just talking about high-level code in general.  If I were allowed, I would still use modports, structs, enums, etc.

Reply by Mark Curry ●May 3, 20172017-05-03

In article <a66c4c17-6f43-4aec-9dd5-c06badf5b11f@googlegroups.com>,
Kevin Neilson  <kevin.neilson@xilinx.com> wrote:
>> (continuing a bit OT...)
>> 
>> Kevin,
>> 
>> That's unfortunate.  We've been very successful with writing parameterizable code - even 
>> before SystemVerilog. Heck even before Verilog-2001.  Things like N-Tap FIRs, 
>> Two-D FIRs.  FFTs, Video Blenders, etc...  All with configurable settings - 
>> bit widths, rounding/truncation options/etc..  I think in a previous job I had a 
>> parametizable Galois Field Multiplier too.
>> 
>> I'm not sure what trouble you had with the tools.  It takes a bit more up front work,
>> but pays off quite a bit in the end.  We really had no choice, given the number of 
>> FPGAs we do, along with how many engineers support them.  Lot's of shared code
>> was the only way to go. 
>> 
>> If you've got something you like, then I suggest keeping it.  But for others,
>> I think writing parameterizable HDL isn't too much trouble - and is made
>> even easier with SystemVerilog.  And higher level too.
>> 
>> Regards,
>> 
>> Mark
>
>I've just been burned too many times.  I know better now.  The last time I made the mistake I was just making a simple PN generator (LFSR).  The
>only complication was that it was highly parallel--I think I had to generate maybe 512 bits per cycle, so it ends up being a big matrix
>multiplication over GF(2).  First I made the high-level version where you could set a parameters for the width and taps and so on.  It took forever
>for Vivado to crank on it.  This is just a few lines of code, mind you, and is just a bunch of XORs.  Then I had Matlab generate an include file
>with the matrix packed into a long parameter which essentially sets up XOR taps.  That was, I think, ~20x faster, which translated into hours of
>synthesis time.  The synthesized circuit was also better for various reasons.  This is just one example.  I also still have to instantiate
>primitives frequently for various reasons.  The level of abstraction doesn't seem like it's changed much in 15 years if you really need performance.
>This doesn't really have anything to do with the SystemVerilog constructs.  I'm just talking about high-level code in general.  If I were allowed, I
>would still use modports, structs, enums, etc.

Ah, we did find something similar in Vivado.  For use is was a large parallel 
CRC - which is pretty much functionally identical to your LFSR (big XOR trees).

We had code that calculated, basically a shift table to calculate the CRC of a long word.
The RTL code worked fine for ISE.  But when we hit Vivado, it'd pause 10 minutes or so 
over each instance (we had lots) which significantly hit our build times.

So, I changed this code to "almost" self-modifying code.  The code would by default
calculate the shift matrix using our "normal" RTL, which looked something like:
      assign H_n_o = h_pow_n( H_zero, NUM_ZEROS_MINUS_ONE );
where H_zero was an "matrix" of constants, and NUM_ZEROS_MINUS_ONE a static
parameter.  The end result is a matrix of constants as well, but "dynamically"
calculated. (Here "dynamically" means once at elaboration time, since all inputs
to the function were static).

Then we just added code to dump each unknown table entry sort-of like:
  if( ( POLY_WIDTH == 8 ) && ( NUM_ZEROS_MINUS_ONE == 7 ) && ( POLYNOMIAL == 'h2f ) )
    assign H_n_o = 'hd4eaf52e175ffba9;
  ...
  else // no table entry - use default RTL calc
    assign H_n_o = h_pow_n( H_zero, NUM_ZEROS_MINUS_ONE );

We "closed" the loop by hand.  If the "table" entry didn't exist, the tool would use the
RTL definition, and spit out the pre-calculated entry.  All done in 
verilog.   We insert that new table entry into our source code by hand, and continue - next
time the build would be quicker.

This *workaround* was a bit kludge, but was the rare (only really) exception for us
in our parameterized code.  Normally the tools just handled things fine.
And again to be clear the only thing we were working around was long synthesis times.  
The quality of results was fine in either case.

Maybe the code you were creating the pendulum swings the other way
and it was more the norm, rather than the exception to see things like this.

Interesting topic, I'm glad to hear of your (and others) experiences.

Regards,

Mark

Reply by Allan Herriman ●May 4, 20172017-05-04

On Wed, 03 May 2017 13:39:38 -0700, Kevin Neilson wrote:

>> (continuing a bit OT...)
>> 
>> Kevin,
>> 
>> That's unfortunate.  We've been very successful with writing
>> parameterizable code - even before SystemVerilog. Heck even before
>> Verilog-2001.  Things like N-Tap FIRs,
>> Two-D FIRs.  FFTs, Video Blenders, etc...  All with configurable
>> settings -
>> bit widths, rounding/truncation options/etc..  I think in a previous
>> job I had a parametizable Galois Field Multiplier too.
>> 
>> I'm not sure what trouble you had with the tools.  It takes a bit more
>> up front work, but pays off quite a bit in the end.  We really had no
>> choice, given the number of FPGAs we do, along with how many engineers
>> support them.  Lot's of shared code was the only way to go.
>> 
>> If you've got something you like, then I suggest keeping it.  But for
>> others,
>> I think writing parameterizable HDL isn't too much trouble - and is
>> made even easier with SystemVerilog.  And higher level too.
>> 
>> Regards,
>> 
>> Mark
> 
> I've just been burned too many times.  I know better now.  The last time
> I made the mistake I was just making a simple PN generator (LFSR).  The
> only complication was that it was highly parallel--I think I had to
> generate maybe 512 bits per cycle, so it ends up being a big matrix
> multiplication over GF(2).  First I made the high-level version where
> you could set a parameters for the width and taps and so on.  It took
> forever for Vivado to crank on it.  This is just a few lines of code,
> mind you, and is just a bunch of XORs.  Then I had Matlab generate an
> include file with the matrix packed into a long parameter which
> essentially sets up XOR taps.  That was, I think, ~20x faster, which
> translated into hours of synthesis time.  The synthesized circuit was
> also better for various reasons.  This is just one example.  I also
> still have to instantiate primitives frequently for various reasons. 
> The level of abstraction doesn't seem like it's changed much in 15 years
> if you really need performance.  This doesn't really have anything to do
> with the SystemVerilog constructs.  I'm just talking about high-level
> code in general.  If I were allowed, I would still use modports,
> structs, enums, etc.

I use Vivado to do GF multiplications that wide using purely behavioural 
VHDL.  BTW, A straightforward behavioural implementation will *not* give 
good results with a wide bus.
I believe the problem is that most tools (in particular Vivado) do a poor 
job of synthesising xor trees with a massive fanin (e.g. >> 100 bits).  
The optimisers have a poor complexity (I guess at least O(N^2), but it 
might be exponential) wrt the size of the function.

You can use all sorts of mathematical tricks to make it work without need 
to go "low level".
For example, to deal with large fanin, partition your 512 bit input into 
N slices of 512/N bits each.  Use N multipliers, one for each slice, put 
a keep (or equivalent) attribute on the outputs, then xor the outputs 
together.  This gives the same result, uses about the same number of LUTs, 
but gives the optimiser in the tool a chance to do a good job.

I use the same GF multiplier code in ISE and Quartus, too (but not on 
buses that wide).

The entire flow is in VHDL and works in any LRM-compliant tool.  It's 
parameterised, too, so I don't need to rewrite for a different bus width.

I've been using similar approaches in VHDL since the turn of the century 
and have never been burned.

YMMV.

Regards,
Allan

Reply by kristoff ●May 4, 20172017-05-04

Hi all,

As a follow-up in the RISC-V thread.

On 02-05-17 18:11, kristoff wrote:
> Or, you can "mix-match" licenses. Sifive (the company that sells the
> E310 CPU and hifive devboards) are an interesting example of this.
> They open-sourced the RTL design but keep the knowledge of actually
> implementing a risc-v core as optimised as possible for themselfs, as a
> service to sell.

This was on eenews Europe today:
http://www.eenewseurope.com/news/sifive-launches-commercial-risc-v-processor-cores

As a small follow-up question:
Does anybody have any idea how to get the hifive boards in Europe?

For the last thing I ordered in the US (a pandaboard), I had to pay VAT 
(ok, that's normal), but also a handling-fee for the shipping-company 
and the customs-service to get the thing shipped in.
In the end, these additional costs where more then the VAT itself.

Cheerio! Kr. Bonne.

Reply by Kevin Neilson ●May 4, 20172017-05-04

> We had code that calculated, basically a shift table to calculate the CRC of a long word.
> The RTL code worked fine for ISE.  But when we hit Vivado, it'd pause 10 minutes or so 
> over each instance (we had lots) which significantly hit our build times.
> 
> So, I changed this code to "almost" self-modifying code.  The code would by default
> calculate the shift matrix using our "normal" RTL, which looked something like:
>       assign H_n_o = h_pow_n( H_zero, NUM_ZEROS_MINUS_ONE );
> where H_zero was an "matrix" of constants, and NUM_ZEROS_MINUS_ONE a static
> parameter.  The end result is a matrix of constants as well, but "dynamically"
> calculated. (Here "dynamically" means once at elaboration time, since all inputs
> to the function were static).
> 
> Then we just added code to dump each unknown table entry sort-of like:
>   if( ( POLY_WIDTH == 8 ) && ( NUM_ZEROS_MINUS_ONE == 7 ) && ( POLYNOMIAL == 'h2f ) )
>     assign H_n_o = 'hd4eaf52e175ffba9;
>   ...
>   else // no table entry - use default RTL calc
>     assign H_n_o = h_pow_n( H_zero, NUM_ZEROS_MINUS_ONE );
> 
> We "closed" the loop by hand.  If the "table" entry didn't exist, the tool would use the
> RTL definition, and spit out the pre-calculated entry.  All done in 
> verilog.   We insert that new table entry into our source code by hand, and continue - next
> time the build would be quicker.
> 
> This *workaround* was a bit kludge, but was the rare (only really) exception for us
> in our parameterized code.  Normally the tools just handled things fine.
> And again to be clear the only thing we were working around was long synthesis times.  
> The quality of results was fine in either case.
> 
> Maybe the code you were creating the pendulum swings the other way
> and it was more the norm, rather than the exception to see things like this.
> 
> Interesting topic, I'm glad to hear of your (and others) experiences.
> 
> Regards,
> 
> Mark

I looked up my notes for the LFSR I was referring to and one instance of the more-abstract version took 16 min to synthesize and the less-abstract version took less than a minute.  (And we needed many instances.)  When I try to do something at a higher level it ends up like your experience:  I have to do a lot of experiments to see what works and then tweak things endlessly.  It eats up a lot of time.

Reply by Kevin Neilson ●May 4, 20172017-05-04

> I use Vivado to do GF multiplications that wide using purely behavioural=
=20
> VHDL.  BTW, A straightforward behavioural implementation will *not* give=
=20
> good results with a wide bus.
> I believe the problem is that most tools (in particular Vivado) do a poor=
=20
> job of synthesising xor trees with a massive fanin (e.g. >> 100 bits). =
=20
> The optimisers have a poor complexity (I guess at least O(N^2), but it=20
> might be exponential) wrt the size of the function.
>=20
> You can use all sorts of mathematical tricks to make it work without need=
=20
> to go "low level".
> For example, to deal with large fanin, partition your 512 bit input into=
=20
> N slices of 512/N bits each.  Use N multipliers, one for each slice, put=
=20
> a keep (or equivalent) attribute on the outputs, then xor the outputs=20
> together.  This gives the same result, uses about the same number of LUTs=
,=20
> but gives the optimiser in the tool a chance to do a good job.
>=20
>=20
> I use the same GF multiplier code in ISE and Quartus, too (but not on=20
> buses that wide).
>=20
> The entire flow is in VHDL and works in any LRM-compliant tool.  It's=20
> parameterised, too, so I don't need to rewrite for a different bus width.
>=20
>=20
> I've been using similar approaches in VHDL since the turn of the century=
=20
> and have never been burned.
>=20
> YMMV.
>=20
> Regards,
> Allan

I used to do big GF matrix multiplications in which you could set parameter=
s for the field size and field generator poly, etc.  Vivado just gets bogge=
d down.  Now I just expand that into a GF(2) matrix in Matlab and dump it t=
o a parameter and all Vivado has to know how to do is XOR.

I also have problems with the wide XORs.  Multiplication by a big GF(2) mat=
rix means a wide XOR for each column.  Vivado tries to share LUTs with comm=
on subexpressions across the columns.  Too much sharing.  That sounds like =
a good thing, but it's not smart enough to know how much it's impacting tim=
ing.  You save LUTs, but you end up with a routing mess and too many levels=
 of logic and you don't come close to meeting timing at all.  So then I hav=
e to make a generate loop and put subsections of the matrix in separate mod=
ules and use directives to prevent optimizing across boundaries.  (KEEPs do=
n't work.)  It's all a pain.  But then I end up with something a little big=
ger but which meets timing.

I really wish there were a way to use the carry chains for wide XORs.

Reply by David Brown ●May 4, 20172017-05-04

On 04/05/17 18:12, kristoff wrote:
> Hi all,
>
>
> As a follow-up in the RISC-V thread.
>
>
> On 02-05-17 18:11, kristoff wrote:
>> Or, you can "mix-match" licenses. Sifive (the company that sells the
>> E310 CPU and hifive devboards) are an interesting example of this.
>> They open-sourced the RTL design but keep the knowledge of actually
>> implementing a risc-v core as optimised as possible for themselfs, as a
>> service to sell.
>
> This was on eenews Europe today:
> http://www.eenewseurope.com/news/sifive-launches-commercial-risc-v-processor-cores
>
>
>
>
>
> As a small follow-up question:
> Does anybody have any idea how to get the hifive boards in Europe?
>

I got one from the crowdsupply site.  I haven't got round to trying it 
yet :-(

> For the last thing I ordered in the US (a pandaboard), I had to pay VAT
> (ok, that's normal), but also a handling-fee for the shipping-company
> and the customs-service to get the thing shipped in.
> In the end, these additional costs where more then the VAT itself.
>
>
>
> Cheerio! Kr. Bonne.

Previous 4 567 Next

RISC-V Support in FPGA

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group