On Sunday, March 22, 2020 at 6:43:31 AM UTC-4, Tom Gardner wrote:
> On 22/03/20 01:15, Julian Kemmerer wrote:
> > Hi folks,
> > Here to talk about PipelineC.
> 
> With anything like this you have 30s to convince me
> to spend some of my remaining life looking at it rather
> than something else. Hence I want to see:
>   - what benefit would it give me, and how
>   - what won't it do for me (it isn't a panacea)
>   - what do I have to do to use it (scope of work)
>   - what don't I have to do if I use it (I'm lazy)
>   - how it fits into the well-documented toolchains
>     that many people use (since it doesn't do everything)
> 
> If I see the negatives, I'm more likely to believe
>     that many people use (since it doesn't do everything)
> the claimed positives.

Give a quick go:

what benefit would it give me, and how:
Feels like RTL when doing clock by clock logic, and can auto pipeline logic otherwise.

what won't it do for me (it isn't a panacea):
Not a full RTL replacement yet. Would love help to get it there.

what do I have to do to use it (scope of work)
Write C-looking code, tool generates VHDL that can dropped into any existing project. Mostly a matter of time to run the tool in addition to already long builds.

what don't I have to do if I use it (I'm lazy):
Dont have to manually pipeline all you logic to specific devices / operating frequencies. Can share 'cross-platform' code.

how it fits into the well-documented toolchains:
Outputs VHDL. And C-looking code can be used with gcc for debug/modeling.

Thanks eh!

Hi folks,
Here to talk about PipelineC.

https://github.com/JulianKemmerer/PipelineC/wiki

What is it?:
- C-like almost hardware description language
- A compiler that produces VHDL for specific devices/operating frequencies
I am looking for:
- anyone who wants to help me develop (Python, VHDL, C)
- suggestions on how to make PipelineC more useful/new features
- project ideas (heyo open source folks)

In the mean time, I am also here to share my most interesting example so fa=
r: Using PipelineC with an AWS F1 instance.=20

https://github.com/JulianKemmerer/PipelineC/wiki/AWS-F1-DMA-Example

I have made an AMI that you can use to play around with. However, it cannot=
 be made public; I can only share it with specific AWS accounts, please mes=
sage me if interested.

I want to share with you why I think PipelineC is particularly powerful:

First, it can mostly replace VHDL/Verilog for describing low level, clock b=
y clock, hardware control logic. Consider the following generic VHDL:

-- Combinatorial logic with a storage register
signal the_reg : some_type_t;
signal the_wire : some_type_t;
process(input, the_reg) is -- inputs sync to clk
    variable input_variable: some_type_t;
    variable the_reg_variable : some_type_t;
begin
    input_variable :=3D input;
    the_reg_variable :=3D the_reg;

    ... Do work with 'input_variable', 'the_reg_variable'
    and other variables, functions, etc and it kinda looks like C ...

    the_wire <=3D the_reg_variable;
end process;
the_reg <=3D the_wire when rising_edge(clk);
output <=3D the_wire;


The equivalent PipelineC is

some_type_t the_reg;
some_type_t some_func_name(some_type_t input)=20
{
    ... Do work with 'input', 'the_reg'
    ... and other variables, functions, etc...

    // Return=3D=3Doutput
    return the_reg;
}

Using that functionality I was able write very RTL-esque serialize+deserial=
ize logic for the AXI4 interface that the AWS F1 shell logic provides to 'c=
ustomer logic' for DMA. The AXI4 is deserialized to a stream of 4096 byte i=
nput data chunks that can be processed by a 'work' function.

I find that most HLS tools have trouble giving the user this sort of low le=
vel control, probably under the assumption that its too low level and not m=
eant for software folks to be concerned with. Most hardware description lan=
guages are built for exactly this though.

Second, PipelineC can replace the most basic feature of other HLS tools: au=
to-pipelineing functions:

This AWS example sums 1024 floating point values via an N clock cycle pipel=
ined binary tree of 1023 floating point adders (soft logic, not hard cores =
yet).=20

Below is the PipelineC code:

float work(float inputs[1024])
{
	// All the nodes of the tree in arrays so can be written using loops
	// ~log2(N) levels, max of N values in parallel
	float nodes[11][1024]; // Unused elements optimize away
=09
	// Assign inputs to level 0
	uint32_t i;
	for(i=3D0; i<1024; i=3Di+1)
	{
		nodes[0][i] =3D inputs[i];
	}
=09
	// Do the computation starting at level 1
	uint32_t n_adds;
	n_adds =3D 1024/2;
	uint32_t level;
	for(level=3D1; level<11; level=3Dlevel+1)
	{=09
		// Parallel sums at this level
		for(i=3D0; i<n_adds; i=3Di+1)
		{
			nodes[level][i] =3D=20
                          nodes[level-1][i*2] + nodes[level-1][(i*2)+1];
		}
	=09
		// Each level decreases adders in next level by half
		n_adds =3D n_adds / 2;
	}
=09
	// Return the last node in tree
	return nodes[10][0];
}

(To be clear, I am NOT claiming that this is the best way to sum floats in =
hardware - its just a basic example big enough to use most of the FPGA).

The PipelineC tool inserts pipeline registers as needed to meet timing on t=
he particular device technology + operating frequency. I find that most HLS=
 tools are pretty good at this (and will do alot more than inferring pipeli=
nes too) but often require some ugly pragmas that - in a way - can make the=
 code undesirably device specific. Hardware description languages can certa=
inly describe the above hardware. But the code will almost certainly descri=
be a pipeline designed specific to device technology/operating frequency - =
making the code hard for others to reuse even if you are kind enough to sha=
re it.

The very capable Virtex Ultrascale+ AWS hardware allows the PipelineC tool =
to fit the work() function into a pipeline depth/latency of 15 clock cycles=
 (might be able to squeeze into few as 10 clocks). Running  at 125MHz, it t=
hus is capable of summing 1024 floating point values in 120 nanoseconds, wi=
th an 8 ns cycle time.

work() Pipeline:
- Frequency: 125 MHz, new inputs each cycle
- Latency: 15 clocks / 120 ns
LUTS   Registers CARRY8 CLB
322144 137181    16307  62664

Here is the 'main' function / top level for the full hardware implementatio=
n:

aws_fpga_dma_outputs_t aws_fpga_dma(aws_fpga_dma_inputs_t i)
{
  // Pull messages out of incoming DMA write data
  dma_msg_s msg_in;
  msg_in =3D deserializer(i.pcis);
 =20
  // Convert incoming DMA message bytes to 'work' inputs
  work_inputs_t work_inputs;
  work_inputs =3D bytes_to_inputs(msg_in.data);
 =20
  // Do some work
  work_outputs_t work_outputs;
  work_outputs =3D work(work_inputs);
 =20
  // Convert 'work' outputs into outgoing DMA message bytes
  dma_msg_s msg_out;
  msg_out.data =3D outputs_to_bytes(work_outputs);
  msg_out.valid =3D msg_in.valid;
 =20
  // Put output message into outgoing DMA read data when requested
  aws_fpga_dma_outputs_t o;
  o.pcis =3D serializer(msg_out, i.pcis.arvalid);
 =20
  return o;
}

On the software side, utilizing the FPGA hardware with user space file I/O =
calls looks like:

// Do work() using the FPGA hardware
work_outputs_t work_fpga(work_inputs_t inputs)
{
	// Convert input into bytes
	dma_msg_t write_msg;
	write_msg =3D inputs_to_bytes(inputs);
	// Write those DMA bytes to the FPGA
	dma_write(write_msg);
	// Read a DMA bytes back from FPGA
	dma_msg_t read_msg;
	read_msg =3D dma_read();
	// Convert bytes to outputs and return
	work_outputs_t work_outputs;
	work_outputs =3D bytes_to_outputs(read_msg);
	return work_outputs;
}

So there you have it: Low level RTL-like control, working right beside high=
ly pipelined logic. All in a familiar C look that could just be compiled wi=
th gcc for 'simulation'. Ex. this example uses the same work() function cod=
e as hardware description and as the 'golden C model' compiled with gcc to =
compare against.

In the sense that C abstracts away the hardware specifics of each CPU archi=
tecture + memory model, but only at a very minimal level, I want PipelineC =
to be the same for digital logic. The same PipelineC code should produce co=
mputationally equivalent hardware on any FPGA/ASIC device technology throug=
h smarts in the compiler. But C/PipelineC obviously doesn't do everything, =
there isnt a whole lot of higher level abstraction done for you. Its just t=
he bedrock to build shareable libraries.

Some big features PipelineC lacks as of the moment
- Flow control/combinatorial feed-backward signals through N clock pipeline=
d logic
 - PipelineC can describe FIFOs, BRAMs (hard BRAM IP is the only IP support=
ed right now) to work with data flows, but the equivalent off a bare combin=
atorial <=3D assignment operator feedback is missing
- Multiple clock domains / clock crossings (have some neat ideas about this=
).
 - This would likely be my next big...many month... task?
- The C parser I'm using doesnt let you return constant sized arrays, but P=
ipelineC as a language really should, but I think if I modified it (oh gosh=
 help me?) and said 'use g++' to compile this 'C code that returns arrays' =
I think it could work out?

Got any ideas on what you'd want to do with PipelineC? Let me know maybe we=
 can make something cool together. Want support for an open source synthesi=
s tool, I can give Yosys a try?

Thanks for your time folks