On Sunday, March 22, 2020 at 6:43:31 AM UTC-4, Tom Gardner wrote:
> On 22/03/20 01:15, Julian Kemmerer wrote:
> > Hi folks,
> > Here to talk about PipelineC.
>
> With anything like this you have 30s to convince me
> to spend some of my remaining life looking at it rather
> than something else. Hence I want to see:
> - what benefit would it give me, and how
> - what won't it do for me (it isn't a panacea)
> - what do I have to do to use it (scope of work)
> - what don't I have to do if I use it (I'm lazy)
> - how it fits into the well-documented toolchains
> that many people use (since it doesn't do everything)
>
> If I see the negatives, I'm more likely to believe
> that many people use (since it doesn't do everything)
> the claimed positives.
Give a quick go:
what benefit would it give me, and how:
Feels like RTL when doing clock by clock logic, and can auto pipeline logic otherwise.
what won't it do for me (it isn't a panacea):
Not a full RTL replacement yet. Would love help to get it there.
what do I have to do to use it (scope of work)
Write C-looking code, tool generates VHDL that can dropped into any existing project. Mostly a matter of time to run the tool in addition to already long builds.
what don't I have to do if I use it (I'm lazy):
Dont have to manually pipeline all you logic to specific devices / operating frequencies. Can share 'cross-platform' code.
how it fits into the well-documented toolchains:
Outputs VHDL. And C-looking code can be used with gcc for debug/modeling.
Thanks eh!
Reply by Tom Gardner●March 22, 20202020-03-22
On 22/03/20 01:15, Julian Kemmerer wrote:
> Hi folks,
> Here to talk about PipelineC.
With anything like this you have 30s to convince me
to spend some of my remaining life looking at it rather
than something else. Hence I want to see:
- what benefit would it give me, and how
- what won't it do for me (it isn't a panacea)
- what do I have to do to use it (scope of work)
- what don't I have to do if I use it (I'm lazy)
- how it fits into the well-documented toolchains
that many people use (since it doesn't do everything)
If I see the negatives, I'm more likely to believe
the claimed positives.
Reply by Julian Kemmerer●March 21, 20202020-03-21
Hi folks,
Here to talk about PipelineC.
https://github.com/JulianKemmerer/PipelineC/wiki
What is it?:
- C-like almost hardware description language
- A compiler that produces VHDL for specific devices/operating frequencies
I am looking for:
- anyone who wants to help me develop (Python, VHDL, C)
- suggestions on how to make PipelineC more useful/new features
- project ideas (heyo open source folks)
In the mean time, I am also here to share my most interesting example so fa=
r: Using PipelineC with an AWS F1 instance.=20
https://github.com/JulianKemmerer/PipelineC/wiki/AWS-F1-DMA-Example
I have made an AMI that you can use to play around with. However, it cannot=
be made public; I can only share it with specific AWS accounts, please mes=
sage me if interested.
I want to share with you why I think PipelineC is particularly powerful:
First, it can mostly replace VHDL/Verilog for describing low level, clock b=
y clock, hardware control logic. Consider the following generic VHDL:
-- Combinatorial logic with a storage register
signal the_reg : some_type_t;
signal the_wire : some_type_t;
process(input, the_reg) is -- inputs sync to clk
variable input_variable: some_type_t;
variable the_reg_variable : some_type_t;
begin
input_variable :=3D input;
the_reg_variable :=3D the_reg;
... Do work with 'input_variable', 'the_reg_variable'
and other variables, functions, etc and it kinda looks like C ...
the_wire <=3D the_reg_variable;
end process;
the_reg <=3D the_wire when rising_edge(clk);
output <=3D the_wire;
The equivalent PipelineC is
some_type_t the_reg;
some_type_t some_func_name(some_type_t input)=20
{
... Do work with 'input', 'the_reg'
... and other variables, functions, etc...
// Return=3D=3Doutput
return the_reg;
}
Using that functionality I was able write very RTL-esque serialize+deserial=
ize logic for the AXI4 interface that the AWS F1 shell logic provides to 'c=
ustomer logic' for DMA. The AXI4 is deserialized to a stream of 4096 byte i=
nput data chunks that can be processed by a 'work' function.
I find that most HLS tools have trouble giving the user this sort of low le=
vel control, probably under the assumption that its too low level and not m=
eant for software folks to be concerned with. Most hardware description lan=
guages are built for exactly this though.
Second, PipelineC can replace the most basic feature of other HLS tools: au=
to-pipelineing functions:
This AWS example sums 1024 floating point values via an N clock cycle pipel=
ined binary tree of 1023 floating point adders (soft logic, not hard cores =
yet).=20
Below is the PipelineC code:
float work(float inputs[1024])
{
// All the nodes of the tree in arrays so can be written using loops
// ~log2(N) levels, max of N values in parallel
float nodes[11][1024]; // Unused elements optimize away
=09
// Assign inputs to level 0
uint32_t i;
for(i=3D0; i<1024; i=3Di+1)
{
nodes[0][i] =3D inputs[i];
}
=09
// Do the computation starting at level 1
uint32_t n_adds;
n_adds =3D 1024/2;
uint32_t level;
for(level=3D1; level<11; level=3Dlevel+1)
{=09
// Parallel sums at this level
for(i=3D0; i<n_adds; i=3Di+1)
{
nodes[level][i] =3D=20
nodes[level-1][i*2] + nodes[level-1][(i*2)+1];
}
=09
// Each level decreases adders in next level by half
n_adds =3D n_adds / 2;
}
=09
// Return the last node in tree
return nodes[10][0];
}
(To be clear, I am NOT claiming that this is the best way to sum floats in =
hardware - its just a basic example big enough to use most of the FPGA).
The PipelineC tool inserts pipeline registers as needed to meet timing on t=
he particular device technology + operating frequency. I find that most HLS=
tools are pretty good at this (and will do alot more than inferring pipeli=
nes too) but often require some ugly pragmas that - in a way - can make the=
code undesirably device specific. Hardware description languages can certa=
inly describe the above hardware. But the code will almost certainly descri=
be a pipeline designed specific to device technology/operating frequency - =
making the code hard for others to reuse even if you are kind enough to sha=
re it.
The very capable Virtex Ultrascale+ AWS hardware allows the PipelineC tool =
to fit the work() function into a pipeline depth/latency of 15 clock cycles=
(might be able to squeeze into few as 10 clocks). Running at 125MHz, it t=
hus is capable of summing 1024 floating point values in 120 nanoseconds, wi=
th an 8 ns cycle time.
work() Pipeline:
- Frequency: 125 MHz, new inputs each cycle
- Latency: 15 clocks / 120 ns
LUTS Registers CARRY8 CLB
322144 137181 16307 62664
Here is the 'main' function / top level for the full hardware implementatio=
n:
aws_fpga_dma_outputs_t aws_fpga_dma(aws_fpga_dma_inputs_t i)
{
// Pull messages out of incoming DMA write data
dma_msg_s msg_in;
msg_in =3D deserializer(i.pcis);
=20
// Convert incoming DMA message bytes to 'work' inputs
work_inputs_t work_inputs;
work_inputs =3D bytes_to_inputs(msg_in.data);
=20
// Do some work
work_outputs_t work_outputs;
work_outputs =3D work(work_inputs);
=20
// Convert 'work' outputs into outgoing DMA message bytes
dma_msg_s msg_out;
msg_out.data =3D outputs_to_bytes(work_outputs);
msg_out.valid =3D msg_in.valid;
=20
// Put output message into outgoing DMA read data when requested
aws_fpga_dma_outputs_t o;
o.pcis =3D serializer(msg_out, i.pcis.arvalid);
=20
return o;
}
On the software side, utilizing the FPGA hardware with user space file I/O =
calls looks like:
// Do work() using the FPGA hardware
work_outputs_t work_fpga(work_inputs_t inputs)
{
// Convert input into bytes
dma_msg_t write_msg;
write_msg =3D inputs_to_bytes(inputs);
// Write those DMA bytes to the FPGA
dma_write(write_msg);
// Read a DMA bytes back from FPGA
dma_msg_t read_msg;
read_msg =3D dma_read();
// Convert bytes to outputs and return
work_outputs_t work_outputs;
work_outputs =3D bytes_to_outputs(read_msg);
return work_outputs;
}
So there you have it: Low level RTL-like control, working right beside high=
ly pipelined logic. All in a familiar C look that could just be compiled wi=
th gcc for 'simulation'. Ex. this example uses the same work() function cod=
e as hardware description and as the 'golden C model' compiled with gcc to =
compare against.
In the sense that C abstracts away the hardware specifics of each CPU archi=
tecture + memory model, but only at a very minimal level, I want PipelineC =
to be the same for digital logic. The same PipelineC code should produce co=
mputationally equivalent hardware on any FPGA/ASIC device technology throug=
h smarts in the compiler. But C/PipelineC obviously doesn't do everything, =
there isnt a whole lot of higher level abstraction done for you. Its just t=
he bedrock to build shareable libraries.
Some big features PipelineC lacks as of the moment
- Flow control/combinatorial feed-backward signals through N clock pipeline=
d logic
- PipelineC can describe FIFOs, BRAMs (hard BRAM IP is the only IP support=
ed right now) to work with data flows, but the equivalent off a bare combin=
atorial <=3D assignment operator feedback is missing
- Multiple clock domains / clock crossings (have some neat ideas about this=
).
- This would likely be my next big...many month... task?
- The C parser I'm using doesnt let you return constant sized arrays, but P=
ipelineC as a language really should, but I think if I modified it (oh gosh=
help me?) and said 'use g++' to compile this 'C code that returns arrays' =
I think it could work out?
Got any ideas on what you'd want to do with PipelineC? Let me know maybe we=
can make something cool together. Want support for an open source synthesi=
s tool, I can give Yosys a try?
Thanks for your time folks