Hi folks looking for feedback on PipelineC. Ideas of what to implement next. I will point you to a recent reddit post which ultimately points to GitHub. https://www.reddit.com/r/FPGA/comments/d0x2p5/serial_8x8_dct_in_pipelinec_lower_resource_usage/ Here is the code to get you interested: // This is the unrolled version of the original dct copy-and-pasted algorithm // https://www.geeksforgeeks.org/discrete-cosine-transform-algorithm-program/ // PipelineC iterations of dctTransformUnrolled are used // to unroll the calculation serially in O(n^4) time // Input 'matrix' and start=1 to begin calculation // Input 'matrix' must stay constant until return .done // 'sum' accumulates over iterations/clocks and should be pipelined // So 'sum' must be a volatile global variable // Keep track of when sum is valid and be read+written volatile uint1_t dct_volatiles_valid; // sum will temporarily store the sum of cosine signals volatile float dct_sum; // dct_result will store the discrete cosine transform // Signal that this is the iteration containing the 'done' result typedef struct dct_done_t { float matrix[DCT_M][DCT_N]; uint1_t done; } dct_done_t; volatile dct_done_t dct_result; dct_done_t dctTransformUnrolled(dct_pixel_t matrix[DCT_M][DCT_N], uint1_t start) { // Assume not done yet dct_result.done = 0; // Start validates volatiles if(start) { dct_volatiles_valid = 1; } // Global func to handle getting to BRAM // 1) Lookup constants from BRAM (using iterators) // 2) Increment iterators // Returns next iterators and constants and will increment when requested dct_lookup_increment_t lookup_increment; uint1_t do_increment; // Only increment when volatiles valid do_increment = dct_volatiles_valid; lookup_increment = dct_lookup_increment(do_increment); // Unpack struct for ease of reading calculation code below float const_val; const_val = lookup_increment.lookup.const_val; float cos_val; cos_val = lookup_increment.lookup.cos_val; dct_iter_t i; i = lookup_increment.incrementer.curr_iters.i; dct_iter_t j; j = lookup_increment.incrementer.curr_iters.j; dct_iter_t k; k = lookup_increment.incrementer.curr_iters.k; dct_iter_t l; l = lookup_increment.incrementer.curr_iters.l; uint1_t reset_k; reset_k = lookup_increment.incrementer.increment.reset_k; uint1_t reset_l; reset_l = lookup_increment.incrementer.increment.reset_l; uint1_t done; done = lookup_increment.incrementer.increment.done; // Do math for this volatile iteration only when // can safely read+write volatiles if(dct_volatiles_valid) { // ~~~ The primary calculation ~~~: // 1) Float * cosine constant from lookup table float dct1; dct1 = (float)matrix[k][l] * cos_val; // 2) Increment sum dct_sum = dct_sum + dct1; // 3) constant * Float and assign into the output matrix dct_result.matrix[i][j] = const_val * dct_sum; // Sum accumulates during the k and l loops // So reset when they are rolling over if(reset_k & reset_l) { dct_sum = 0.0; } // Done yet? dct_result.done = done; // Reset volatiles once done if(done) { dct_volatiles_valid = 0; } } return dct_result; } What does this synthesize to? Essentially a state machine where each state uses the same N clocks worth of logic to do work. (the body of dctTransformUnrolled). Consider the 'execution' of the function in time order. The logic consists of: ~17% of time for getting lookup constants & incrementing the iterators (dct_lookup_increment), reading the [k][l] value out of input 'matrix' ~21% of time for the 1) Float * cosine constant from lookup table, a floating point multiplier ~34% of time for the 2) Increment sum addition, a floating point adder ~21% of time for the 3) constant * Float, a floating point multiplier ~5% of time for the 3) assignment into the output matrix at [i][j] That pipeline takes some fixed number of clock cycles N. That means every N clock cycles 'dct_volatiles_valid' will =1 (after being set at the start). The algorithm unrolls as O(n^4) for 4096 total iterations. So the total latency in clock cycles is N * 4096.
PipelineC (again), dct example, looking for help/interest
Started by ●September 7, 2019