FPGARelated.com
Forums

PipelineC (again), dct example, looking for help/interest

Started by Julian Kemmerer September 7, 2019
Hi folks looking for feedback on PipelineC. Ideas of what to implement next. 

I will point you to a recent reddit post which ultimately points to GitHub. 

https://www.reddit.com/r/FPGA/comments/d0x2p5/serial_8x8_dct_in_pipelinec_lower_resource_usage/ 

Here is the code to get you interested: 

// This is the unrolled version of the original dct copy-and-pasted algorithm 
// https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-program/ 
// PipelineC iterations of dctTransformUnrolled are used 
// to unroll the calculation serially in O(n^4) time 

// Input 'matrix' and start=1 to begin calculation 
// Input 'matrix' must stay constant until return .done 

// 'sum' accumulates over iterations/clocks and should be pipelined 
// So 'sum' must be a volatile global variable 
// Keep track of when sum is valid and be read+written 
volatile uint1_t dct_volatiles_valid; 
// sum will temporarily store the sum of cosine signals 
volatile float dct_sum; 
// dct_result will store the discrete cosine transform 
// Signal that this is the iteration containing the 'done' result 
typedef struct dct_done_t 
{ 
        float matrix[DCT_M][DCT_N]; 
        uint1_t done; 
} dct_done_t; 
volatile dct_done_t dct_result; 
dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t start) 
{ 
        // Assume not done yet 
        dct_result.done = 0; 
         
        // Start validates volatiles 
        if(start) 
        { 
                dct_volatiles_valid = 1; 
        } 
         
        // Global func to handle getting to BRAM 
        //     1) Lookup constants from BRAM (using iterators) 
        //     2) Increment iterators 
        // Returns next iterators and constants and will increment when requested 
        dct_lookup_increment_t lookup_increment; 
        uint1_t do_increment; 
        // Only increment when volatiles valid 
        do_increment = dct_volatiles_valid; 
        lookup_increment = dct_lookup_increment(do_increment); 
         
        // Unpack struct for ease of reading calculation code below 
        float const_val; 
        const_val = lookup_increment.lookup.const_val; 
        float cos_val; 
        cos_val = lookup_increment.lookup.cos_val; 
        dct_iter_t i; 
        i = lookup_increment.incrementer.curr_iters.i; 
        dct_iter_t j; 
        j = lookup_increment.incrementer.curr_iters.j; 
        dct_iter_t k; 
        k = lookup_increment.incrementer.curr_iters.k; 
        dct_iter_t l; 
        l = lookup_increment.incrementer.curr_iters.l; 
        uint1_t reset_k; 
        reset_k = lookup_increment.incrementer.increment.reset_k; 
        uint1_t reset_l; 
        reset_l = lookup_increment.incrementer.increment.reset_l; 
        uint1_t done; 
        done = lookup_increment.incrementer.increment.done; 
         
         
        // Do math for this volatile iteration only when 
        // can safely read+write volatiles 
        if(dct_volatiles_valid) 
        { 
                // ~~~ The primary calculation ~~~: 
                // 1) Float * cosine constant from lookup table   
                float dct1; 
                dct1 = (float)matrix[k][l] * cos_val; 
                // 2) Increment sum 
                dct_sum = dct_sum + dct1; 
                // 3) constant * Float and assign into the output matrix 
                dct_result.matrix[i][j] = const_val * dct_sum; 
                 
                // Sum accumulates during the k and l loops 
                // So reset when they are rolling over 
                if(reset_k & reset_l) 
                { 
                        dct_sum = 0.0; 
                } 
                 
                // Done yet? 
                dct_result.done = done; 
                 
                // Reset volatiles once done 
                if(done) 
                { 
                        dct_volatiles_valid = 0; 
                } 
        } 
         
        return dct_result; 
} 
What does this synthesize to? 

Essentially a state machine where each state uses the same N clocks worth of logic to do work. (the body of dctTransformUnrolled). 

Consider the 'execution' of the function in time order. The logic consists of: 

~17% of time for getting lookup constants & incrementing the iterators (dct_lookup_increment), reading the [k][l] value out of input 'matrix' 

~21% of time for the 1) Float * cosine constant from lookup table, a floating point multiplier 

~34% of time for the 2) Increment sum addition, a floating point adder 

~21% of time for the 3) constant * Float, a floating point multiplier 

~5% of time for the 3) assignment into the output matrix at [i][j] 

That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.