comp.arch.fpga | Accelerating Face Detection on Zynq-7020 Using High Level Synthesis

Hello, here is my question:
Purpose: realize face detection on zynq-7020 SoC
Platform: Zedboard with OV5640 camera
Completed work: capturing video from camera, writing into DDR for storage and reading from DDR for display
Question: how to realize a face detection IP and its throughput can reach 30fps(pixel 320*240) 

Here are my jobs:
Base on the Viola Jones algorithm, using HLS(high level synthesis) tool to realize hardware IP from a C++ design
And this is my reference: https://github.com/cornell-zhang/facedetect-fpga

I have simulate and synthesize it into hardware IP, but its throughput does not reach the goal because the interval and latency are very large. (latency is 338(min) to 576593236(max), interval is 336(min) to 142310514002(max))
Looking into the code, I find the latency is mainly caused by the following for loops, but I don't know how to optimize the latency compromising between area.

So you may help me a lot with these:
1.Another way to realize face detection on zynq-7020?
2.How to test the throughput of my system and the relation between real throughput and the synthesis result?
3.Any way to optimize the following for loops?

Looking forward to your reply. Please feel free to contact me at anytime.
Thanks.

----loop1:
imageScalerL1: for ( i = 0 ; i < IMAGE_HEIGHT ; i++ ){ 
    imageScalerL1_1: for (j=0;j < IMAGE_WIDTH ;j++) {
      #pragma HLS pipeline
      if ( j < w2 && i < h2 ) 
        IMG1_data[i][j] =  Data[(i*y_ratio)>>16][(j*x_ratio)>>16];
    }
  }
----loop2:
Pixely: for( y = 0; y < sum_row; y++ ){
    Pixelx: for ( x = 0; x < sum_col; x++ ){
      /* Updates for Integral Image Window Buffer (I) */
      SetIIu: for ( u = 0; u < WINDOW_SIZE; u++){
      #pragma HLS unroll
        SetIIj: for ( v = 0; v < WINDOW_SIZE; v++ ){
        #pragma HLS unroll
          II[u][v] = II[u][v] + ( I[u][v+1] - I[u][0] );
        }
      }
      
      /* Updates for Square Image Window Buffer (SI) */
      SII[0][0] = SII[0][0] + ( SI[0][1] - SI[0][0] );
      SII[0][1] = SII[0][1] + ( SI[0][WINDOW_SIZE] - SI[0][0] );
      SII[1][0] = SII[1][0] + ( SI[WINDOW_SIZE-1][1] - SI[WINDOW_SIZE-1][0] );
      SII[1][1] = SII[1][1] + ( SI[WINDOW_SIZE-1][WINDOW_SIZE] - SI[WINDOW_SIZE-1][0] );
      
      /* Updates for Image Window Buffer (I) and Square Image Window Bufer (SI) */
      SetIj: for( j = 0; j < 2*WINDOW_SIZE-1; j++){
      #pragma HLS unroll
        SetIi: for( i = 0; i < WINDOW_SIZE; i++ ){
        #pragma HLS unroll
          if( i+j != 2*WINDOW_SIZE-1 ){
            I[i][j] = I[i][j+1];
            SI[i][j] = SI[i][j+1];
          }
          else if ( i > 0 ){
            I[i][j] = I[i][j+1] + I[i-1][j+1];
            SI[i][j] = SI[i][j+1] + SI[i-1][j+1];
          }
        }
      }
      // Last column of the I[][] and SI[][] matrix 
      Ilast: for( i = 0; i < WINDOW_SIZE-1; i++ ){
      #pragma HLS unroll
        I[i][2*WINDOW_SIZE-1] = L[i][x];
        SI[i][2*WINDOW_SIZE-1] = L[i][x]*L[i][x];
      }
      I[WINDOW_SIZE-1][2*WINDOW_SIZE-1] = IMG1_data[y][x];
      SI[WINDOW_SIZE-1][2*WINDOW_SIZE-1] = IMG1_data[y][x]*IMG1_data[y][x];

      /** Updates for Image Line Buffer (L) **/
      LineBuf: for( k = 0; k < WINDOW_SIZE-2; k++ ){
      #pragma HLS unroll
        L[k][x] = L[k+1][x];
      }
      L[WINDOW_SIZE-2][x] = IMG1_data[y][x];

      /* Pass the Integral Image Window buffer through Cascaded Classifier. Only pass
       * when the integral image window buffer has flushed out the initial garbage data */             
      if ( element_counter >= ( ( (WINDOW_SIZE-1)*sum_col + WINDOW_SIZE ) + WINDOW_SIZE -1 ) ) {

         /* Sliding Window should not go beyond the boundary */
         if ( x_index < ( sum_col - (WINDOW_SIZE-1) ) && y_index < ( sum_row - (WINDOW_SIZE-1) ) ){
            p.x = x_index;
            p.y = y_index;
            
            result = cascadeClassifier ( p, II, SII );

           if ( result > 0 )
           {
             MyRect r = {myRound(p.x*factor), myRound(p.y*factor), winSize.width, winSize.height};
             AllCandidates_x[*AllCandidates_size]=r.x;
             AllCandidates_y[*AllCandidates_size]=r.y;
             AllCandidates_w[*AllCandidates_size]=r.width;
             AllCandidates_h[*AllCandidates_size]=r.height;
            *AllCandidates_size=*AllCandidates_size+1;
           }
         }// inner if
         if ( x_index < sum_col-1 )
             x_index = x_index + 1;
         else{ 
             x_index = 0;
             y_index = y_index + 1;
         }
      } // outer if
      element_counter +=1;
    } 
  }

Reply by rickman ●May 23, 20172017-05-23

yuning he wrote:
> Hello, here is my question:
> Purpose: realize face detection on zynq-7020 SoC
> Platform: Zedboard with OV5640 camera
> Completed work: capturing video from camera, writing into DDR for storage and reading from DDR for display
> Question: how to realize a face detection IP and its throughput can reach 30fps(pixel 320*240)
>
> Here are my jobs:
> Base on the Viola Jones algorithm, using HLS(high level synthesis) tool to realize hardware IP from a C++ design
> And this is my reference: https://github.com/cornell-zhang/facedetect-fpga
>
> I have simulate and synthesize it into hardware IP, but its throughput does not reach the goal because the interval and latency are very large. (latency is 338(min) to 576593236(max), interval is 336(min) to 142310514002(max))
> Looking into the code, I find the latency is mainly caused by the following for loops, but I don't know how to optimize the latency compromising between area.
>
> So you may help me a lot with these:
> 1.Another way to realize face detection on zynq-7020?
> 2.How to test the throughput of my system and the relation between real throughput and the synthesis result?
> 3.Any way to optimize the following for loops?
>
> Looking forward to your reply. Please feel free to contact me at anytime.
> Thanks.
>
> ----loop1:
> imageScalerL1: for ( i = 0 ; i < IMAGE_HEIGHT ; i++ ){
>     imageScalerL1_1: for (j=0;j < IMAGE_WIDTH ;j++) {
>       #pragma HLS pipeline
>       if ( j < w2 && i < h2 )
>         IMG1_data[i][j] =  Data[(i*y_ratio)>>16][(j*x_ratio)>>16];
>     }
>   }
> ----loop2:
> Pixely: for( y = 0; y < sum_row; y++ ){
>     Pixelx: for ( x = 0; x < sum_col; x++ ){
>       /* Updates for Integral Image Window Buffer (I) */
>       SetIIu: for ( u = 0; u < WINDOW_SIZE; u++){
>       #pragma HLS unroll
>         SetIIj: for ( v = 0; v < WINDOW_SIZE; v++ ){
>         #pragma HLS unroll
>           II[u][v] = II[u][v] + ( I[u][v+1] - I[u][0] );
>         }
>       }
>
>       /* Updates for Square Image Window Buffer (SI) */
>       SII[0][0] = SII[0][0] + ( SI[0][1] - SI[0][0] );
>       SII[0][1] = SII[0][1] + ( SI[0][WINDOW_SIZE] - SI[0][0] );
>       SII[1][0] = SII[1][0] + ( SI[WINDOW_SIZE-1][1] - SI[WINDOW_SIZE-1][0] );
>       SII[1][1] = SII[1][1] + ( SI[WINDOW_SIZE-1][WINDOW_SIZE] - SI[WINDOW_SIZE-1][0] );
>
>       /* Updates for Image Window Buffer (I) and Square Image Window Bufer (SI) */
>       SetIj: for( j = 0; j < 2*WINDOW_SIZE-1; j++){
>       #pragma HLS unroll
>         SetIi: for( i = 0; i < WINDOW_SIZE; i++ ){
>         #pragma HLS unroll
>           if( i+j != 2*WINDOW_SIZE-1 ){
>             I[i][j] = I[i][j+1];
>             SI[i][j] = SI[i][j+1];
>           }
>           else if ( i > 0 ){
>             I[i][j] = I[i][j+1] + I[i-1][j+1];
>             SI[i][j] = SI[i][j+1] + SI[i-1][j+1];
>           }
>         }
>       }
>       // Last column of the I[][] and SI[][] matrix
>       Ilast: for( i = 0; i < WINDOW_SIZE-1; i++ ){
>       #pragma HLS unroll
>         I[i][2*WINDOW_SIZE-1] = L[i][x];
>         SI[i][2*WINDOW_SIZE-1] = L[i][x]*L[i][x];
>       }
>       I[WINDOW_SIZE-1][2*WINDOW_SIZE-1] = IMG1_data[y][x];
>       SI[WINDOW_SIZE-1][2*WINDOW_SIZE-1] = IMG1_data[y][x]*IMG1_data[y][x];
>
>       /** Updates for Image Line Buffer (L) **/
>       LineBuf: for( k = 0; k < WINDOW_SIZE-2; k++ ){
>       #pragma HLS unroll
>         L[k][x] = L[k+1][x];
>       }
>       L[WINDOW_SIZE-2][x] = IMG1_data[y][x];
>
>       /* Pass the Integral Image Window buffer through Cascaded Classifier. Only pass
>        * when the integral image window buffer has flushed out the initial garbage data */
>       if ( element_counter >= ( ( (WINDOW_SIZE-1)*sum_col + WINDOW_SIZE ) + WINDOW_SIZE -1 ) ) {
>
>          /* Sliding Window should not go beyond the boundary */
>          if ( x_index < ( sum_col - (WINDOW_SIZE-1) ) && y_index < ( sum_row - (WINDOW_SIZE-1) ) ){
>             p.x = x_index;
>             p.y = y_index;
>
>             result = cascadeClassifier ( p, II, SII );
>
>            if ( result > 0 )
>            {
>              MyRect r = {myRound(p.x*factor), myRound(p.y*factor), winSize.width, winSize.height};
>              AllCandidates_x[*AllCandidates_size]=r.x;
>              AllCandidates_y[*AllCandidates_size]=r.y;
>              AllCandidates_w[*AllCandidates_size]=r.width;
>              AllCandidates_h[*AllCandidates_size]=r.height;
>             *AllCandidates_size=*AllCandidates_size+1;
>            }
>          }// inner if
>          if ( x_index < sum_col-1 )
>              x_index = x_index + 1;
>          else{
>              x_index = 0;
>              y_index = y_index + 1;
>          }
>       } // outer if
>       element_counter +=1;
>     }
>   }

Verilog is not my forte, but I think arrays are arrays.  In the initial 
loop you are retrieving the data from 
Data[(i*y_ratio)>>16][(j*x_ratio)>>16].  The range of one index is 0 to 
IMAGE_HEIGHT-1 and the other is 0 to IMAGE_WIDTH-1.  I can never recall 
which is the inner index and which is the outer, but the math involved 
in calculating the address of the data is simpler if the inner index 
range is a binary power.  Is that the case?  If not, you can achieve the 
simplification by declaring the inner index to have a range which is a 
binary power, but only use a subrange that you need.  The cost is wasted 
memory, but it will improve performance and size because the address 
calculation will not require multiplication, but rather shifts which are 
done by mapping the index to the right address lines.

-- 

Rick C

Reply by Kevin Neilson ●May 23, 20172017-05-23

> Looking into the code, I find the latency is mainly caused by the followi=
ng for loops, but I don't know how to optimize the latency compromising bet=
ween area.
>=20
You're not meeting timing, which means you probably need to go look at the =
schematic of the critical paths.  How many levels are they?  How are the mu=
ltipliers being synthesized?  Are they using DSP48s or fabric?  As a rule, =
the more abstract a synthesis tool is, the worse the synthesis results will=
 be.

Also, where is "Data[]"?  Is that a blockRAM?  Or is it DRAM?  If you're ac=
cessing DRAM directly without a cache, you might have problems.

Reply by Rob Gaddi ●May 23, 20172017-05-23

On 05/23/2017 11:21 AM, Kevin Neilson wrote:
> As a rule, the more abstract a synthesis tool is, the worse the synthesis results will be.
> 

Sorry, I just walked back from lunch through a small horde of web 
developers, and was trying to envision the looks on their faces as C was 
referred to as being much too high level.

-- 
Rob Gaddi, Highland Technology -- www.highlandtechnology.com
Email address domain is currently out of order.  See above to fix.

Reply by yuning he ●May 24, 20172017-05-24

Thank you for your reply.

The timing can meet when I slow down the clk to be 100MHz. 
The mutipliers are synthesized by DSP48s automatically. 
And "Data[]" is saved by BlockRAM dircetly.

Reply by yuning he ●May 24, 20172017-05-24

&#22312; 2017&#24180;5&#26376;23&#26085;&#26143;&#26399;&#20108; UTC+8&#19979;&#21320;3:11:40&#65292;rickman&#20889;&#36947;&#65306;
> yuning he wrote:
> > Hello, here is my question:
> > Purpose: realize face detection on zynq-7020 SoC
> > Platform: Zedboard with OV5640 camera
> > Completed work: capturing video from camera, writing into DDR for storage and reading from DDR for display
> > Question: how to realize a face detection IP and its throughput can reach 30fps(pixel 320*240)
> >
> > Here are my jobs:
> > Base on the Viola Jones algorithm, using HLS(high level synthesis) tool to realize hardware IP from a C++ design
> > And this is my reference: https://github.com/cornell-zhang/facedetect-fpga
> >
> > I have simulate and synthesize it into hardware IP, but its throughput does not reach the goal because the interval and latency are very large. (latency is 338(min) to 576593236(max), interval is 336(min) to 142310514002(max))
> > Looking into the code, I find the latency is mainly caused by the following for loops, but I don't know how to optimize the latency compromising between area.
> >
> > So you may help me a lot with these:
> > 1.Another way to realize face detection on zynq-7020?
> > 2.How to test the throughput of my system and the relation between real throughput and the synthesis result?
> > 3.Any way to optimize the following for loops?
> >
> > Looking forward to your reply. Please feel free to contact me at anytime.
> > Thanks.
> >
> > ----loop1:
> > imageScalerL1: for ( i = 0 ; i < IMAGE_HEIGHT ; i++ ){
> >     imageScalerL1_1: for (j=0;j < IMAGE_WIDTH ;j++) {
> >       #pragma HLS pipeline
> >       if ( j < w2 && i < h2 )
> >         IMG1_data[i][j] =  Data[(i*y_ratio)>>16][(j*x_ratio)>>16];
> >     }
> >   }
> > ----loop2:
> > Pixely: for( y = 0; y < sum_row; y++ ){
> >     Pixelx: for ( x = 0; x < sum_col; x++ ){
> >       /* Updates for Integral Image Window Buffer (I) */
> >       SetIIu: for ( u = 0; u < WINDOW_SIZE; u++){
> >       #pragma HLS unroll
> >         SetIIj: for ( v = 0; v < WINDOW_SIZE; v++ ){
> >         #pragma HLS unroll
> >           II[u][v] = II[u][v] + ( I[u][v+1] - I[u][0] );
> >         }
> >       }
> >
> >       /* Updates for Square Image Window Buffer (SI) */
> >       SII[0][0] = SII[0][0] + ( SI[0][1] - SI[0][0] );
> >       SII[0][1] = SII[0][1] + ( SI[0][WINDOW_SIZE] - SI[0][0] );
> >       SII[1][0] = SII[1][0] + ( SI[WINDOW_SIZE-1][1] - SI[WINDOW_SIZE-1][0] );
> >       SII[1][1] = SII[1][1] + ( SI[WINDOW_SIZE-1][WINDOW_SIZE] - SI[WINDOW_SIZE-1][0] );
> >
> >       /* Updates for Image Window Buffer (I) and Square Image Window Bufer (SI) */
> >       SetIj: for( j = 0; j < 2*WINDOW_SIZE-1; j++){
> >       #pragma HLS unroll
> >         SetIi: for( i = 0; i < WINDOW_SIZE; i++ ){
> >         #pragma HLS unroll
> >           if( i+j != 2*WINDOW_SIZE-1 ){
> >             I[i][j] = I[i][j+1];
> >             SI[i][j] = SI[i][j+1];
> >           }
> >           else if ( i > 0 ){
> >             I[i][j] = I[i][j+1] + I[i-1][j+1];
> >             SI[i][j] = SI[i][j+1] + SI[i-1][j+1];
> >           }
> >         }
> >       }
> >       // Last column of the I[][] and SI[][] matrix
> >       Ilast: for( i = 0; i < WINDOW_SIZE-1; i++ ){
> >       #pragma HLS unroll
> >         I[i][2*WINDOW_SIZE-1] = L[i][x];
> >         SI[i][2*WINDOW_SIZE-1] = L[i][x]*L[i][x];
> >       }
> >       I[WINDOW_SIZE-1][2*WINDOW_SIZE-1] = IMG1_data[y][x];
> >       SI[WINDOW_SIZE-1][2*WINDOW_SIZE-1] = IMG1_data[y][x]*IMG1_data[y][x];
> >
> >       /** Updates for Image Line Buffer (L) **/
> >       LineBuf: for( k = 0; k < WINDOW_SIZE-2; k++ ){
> >       #pragma HLS unroll
> >         L[k][x] = L[k+1][x];
> >       }
> >       L[WINDOW_SIZE-2][x] = IMG1_data[y][x];
> >
> >       /* Pass the Integral Image Window buffer through Cascaded Classifier. Only pass
> >        * when the integral image window buffer has flushed out the initial garbage data */
> >       if ( element_counter >= ( ( (WINDOW_SIZE-1)*sum_col + WINDOW_SIZE ) + WINDOW_SIZE -1 ) ) {
> >
> >          /* Sliding Window should not go beyond the boundary */
> >          if ( x_index < ( sum_col - (WINDOW_SIZE-1) ) && y_index < ( sum_row - (WINDOW_SIZE-1) ) ){
> >             p.x = x_index;
> >             p.y = y_index;
> >
> >             result = cascadeClassifier ( p, II, SII );
> >
> >            if ( result > 0 )
> >            {
> >              MyRect r = {myRound(p.x*factor), myRound(p.y*factor), winSize.width, winSize.height};
> >              AllCandidates_x[*AllCandidates_size]=r.x;
> >              AllCandidates_y[*AllCandidates_size]=r.y;
> >              AllCandidates_w[*AllCandidates_size]=r.width;
> >              AllCandidates_h[*AllCandidates_size]=r.height;
> >             *AllCandidates_size=*AllCandidates_size+1;
> >            }
> >          }// inner if
> >          if ( x_index < sum_col-1 )
> >              x_index = x_index + 1;
> >          else{
> >              x_index = 0;
> >              y_index = y_index + 1;
> >          }
> >       } // outer if
> >       element_counter +=1;
> >     }
> >   }
> 
> Verilog is not my forte, but I think arrays are arrays.  In the initial 
> loop you are retrieving the data from 
> Data[(i*y_ratio)>>16][(j*x_ratio)>>16].  The range of one index is 0 to 
> IMAGE_HEIGHT-1 and the other is 0 to IMAGE_WIDTH-1.  I can never recall 
> which is the inner index and which is the outer, but the math involved 
> in calculating the address of the data is simpler if the inner index 
> range is a binary power.  Is that the case?  If not, you can achieve the 
> simplification by declaring the inner index to have a range which is a 
> binary power, but only use a subrange that you need.  The cost is wasted 
> memory, but it will improve performance and size because the address 
> calculation will not require multiplication, but rather shifts which are 
> done by mapping the index to the right address lines.
> 
> -- 
> 
> Rick C

Thank you for your reply.
Here IMAGE_HEIGHT equals to 240, and IMAGE_WIDTH equals to 320.According to your advice, I can change the inner index of the array to be a binary power to accelerate the address access. Is this right?

Reply by rickman ●May 24, 20172017-05-24

yuning he wrote:
> &#22312; 2017&#24180;5&#26376;23&#26085;&#26143;&#26399;&#20108; UTC+8&#19979;&#21320;3:11:40&#65292;rickman&#20889;&#36947;&#65306;
>> yuning he wrote:
>>> Hello, here is my question:
>>> Purpose: realize face detection on zynq-7020 SoC
>>> Platform: Zedboard with OV5640 camera
>>> Completed work: capturing video from camera, writing into DDR for storage and reading from DDR for display
>>> Question: how to realize a face detection IP and its throughput can reach 30fps(pixel 320*240)
>>>
>>> Here are my jobs:
>>> Base on the Viola Jones algorithm, using HLS(high level synthesis) tool to realize hardware IP from a C++ design
>>> And this is my reference: https://github.com/cornell-zhang/facedetect-fpga
>>>
>>> I have simulate and synthesize it into hardware IP, but its throughput does not reach the goal because the interval and latency are very large. (latency is 338(min) to 576593236(max), interval is 336(min) to 142310514002(max))
>>> Looking into the code, I find the latency is mainly caused by the following for loops, but I don't know how to optimize the latency compromising between area.
>>>
>>> So you may help me a lot with these:
>>> 1.Another way to realize face detection on zynq-7020?
>>> 2.How to test the throughput of my system and the relation between real throughput and the synthesis result?
>>> 3.Any way to optimize the following for loops?
>>>
>>> Looking forward to your reply. Please feel free to contact me at anytime.
>>> Thanks.
>>>
>>> ----loop1:
>>> imageScalerL1: for ( i = 0 ; i < IMAGE_HEIGHT ; i++ ){
>>>     imageScalerL1_1: for (j=0;j < IMAGE_WIDTH ;j++) {
>>>       #pragma HLS pipeline
>>>       if ( j < w2 && i < h2 )
>>>         IMG1_data[i][j] =  Data[(i*y_ratio)>>16][(j*x_ratio)>>16];
>>>     }
>>>   }
>>> ----loop2:
>>> Pixely: for( y = 0; y < sum_row; y++ ){
>>>     Pixelx: for ( x = 0; x < sum_col; x++ ){
>>>       /* Updates for Integral Image Window Buffer (I) */
>>>       SetIIu: for ( u = 0; u < WINDOW_SIZE; u++){
>>>       #pragma HLS unroll
>>>         SetIIj: for ( v = 0; v < WINDOW_SIZE; v++ ){
>>>         #pragma HLS unroll
>>>           II[u][v] = II[u][v] + ( I[u][v+1] - I[u][0] );
>>>         }
>>>       }
>>>
>>>       /* Updates for Square Image Window Buffer (SI) */
>>>       SII[0][0] = SII[0][0] + ( SI[0][1] - SI[0][0] );
>>>       SII[0][1] = SII[0][1] + ( SI[0][WINDOW_SIZE] - SI[0][0] );
>>>       SII[1][0] = SII[1][0] + ( SI[WINDOW_SIZE-1][1] - SI[WINDOW_SIZE-1][0] );
>>>       SII[1][1] = SII[1][1] + ( SI[WINDOW_SIZE-1][WINDOW_SIZE] - SI[WINDOW_SIZE-1][0] );
>>>
>>>       /* Updates for Image Window Buffer (I) and Square Image Window Bufer (SI) */
>>>       SetIj: for( j = 0; j < 2*WINDOW_SIZE-1; j++){
>>>       #pragma HLS unroll
>>>         SetIi: for( i = 0; i < WINDOW_SIZE; i++ ){
>>>         #pragma HLS unroll
>>>           if( i+j != 2*WINDOW_SIZE-1 ){
>>>             I[i][j] = I[i][j+1];
>>>             SI[i][j] = SI[i][j+1];
>>>           }
>>>           else if ( i > 0 ){
>>>             I[i][j] = I[i][j+1] + I[i-1][j+1];
>>>             SI[i][j] = SI[i][j+1] + SI[i-1][j+1];
>>>           }
>>>         }
>>>       }
>>>       // Last column of the I[][] and SI[][] matrix
>>>       Ilast: for( i = 0; i < WINDOW_SIZE-1; i++ ){
>>>       #pragma HLS unroll
>>>         I[i][2*WINDOW_SIZE-1] = L[i][x];
>>>         SI[i][2*WINDOW_SIZE-1] = L[i][x]*L[i][x];
>>>       }
>>>       I[WINDOW_SIZE-1][2*WINDOW_SIZE-1] = IMG1_data[y][x];
>>>       SI[WINDOW_SIZE-1][2*WINDOW_SIZE-1] = IMG1_data[y][x]*IMG1_data[y][x];
>>>
>>>       /** Updates for Image Line Buffer (L) **/
>>>       LineBuf: for( k = 0; k < WINDOW_SIZE-2; k++ ){
>>>       #pragma HLS unroll
>>>         L[k][x] = L[k+1][x];
>>>       }
>>>       L[WINDOW_SIZE-2][x] = IMG1_data[y][x];
>>>
>>>       /* Pass the Integral Image Window buffer through Cascaded Classifier. Only pass
>>>        * when the integral image window buffer has flushed out the initial garbage data */
>>>       if ( element_counter >= ( ( (WINDOW_SIZE-1)*sum_col + WINDOW_SIZE ) + WINDOW_SIZE -1 ) ) {
>>>
>>>          /* Sliding Window should not go beyond the boundary */
>>>          if ( x_index < ( sum_col - (WINDOW_SIZE-1) ) && y_index < ( sum_row - (WINDOW_SIZE-1) ) ){
>>>             p.x = x_index;
>>>             p.y = y_index;
>>>
>>>             result = cascadeClassifier ( p, II, SII );
>>>
>>>            if ( result > 0 )
>>>            {
>>>              MyRect r = {myRound(p.x*factor), myRound(p.y*factor), winSize.width, winSize.height};
>>>              AllCandidates_x[*AllCandidates_size]=r.x;
>>>              AllCandidates_y[*AllCandidates_size]=r.y;
>>>              AllCandidates_w[*AllCandidates_size]=r.width;
>>>              AllCandidates_h[*AllCandidates_size]=r.height;
>>>             *AllCandidates_size=*AllCandidates_size+1;
>>>            }
>>>          }// inner if
>>>          if ( x_index < sum_col-1 )
>>>              x_index = x_index + 1;
>>>          else{
>>>              x_index = 0;
>>>              y_index = y_index + 1;
>>>          }
>>>       } // outer if
>>>       element_counter +=1;
>>>     }
>>>   }
>>
>> Verilog is not my forte, but I think arrays are arrays.  In the initial
>> loop you are retrieving the data from
>> Data[(i*y_ratio)>>16][(j*x_ratio)>>16].  The range of one index is 0 to
>> IMAGE_HEIGHT-1 and the other is 0 to IMAGE_WIDTH-1.  I can never recall
>> which is the inner index and which is the outer, but the math involved
>> in calculating the address of the data is simpler if the inner index
>> range is a binary power.  Is that the case?  If not, you can achieve the
>> simplification by declaring the inner index to have a range which is a
>> binary power, but only use a subrange that you need.  The cost is wasted
>> memory, but it will improve performance and size because the address
>> calculation will not require multiplication, but rather shifts which are
>> done by mapping the index to the right address lines.
>>
>> --
>>
>> Rick C
>
> Thank you for your reply.
> Here IMAGE_HEIGHT equals to 240, and IMAGE_WIDTH equals to 320.According to your advice, I can change the inner index of the array to be a binary power to accelerate the address access. Is this right?

I believe so.  To minimize the waste of memory, I would make the 240 the 
inner index with a range of 256.  Then the multiplication becomes a 
matter of shifting the outer index by 8 bits and adding to the inner 
index.  I don't know for sure, but the tools should figure this out 
automatically.

Keep your loop range as 0 to 239 and everything will still work as you 
expect skipping over 16 array values at each increment of the outer 
index.  You will need to be consistent in all accesses to the memory.

-- 

Rick C

Accelerating Face Detection on Zynq-7020 Using High Level Synthesis

Sign in

You might also like...

Search forums

Free PDF Downloads

Blogs - Hall of Fame

Quick Links

About FPGARelated.com

Social Networks

The Related Media Group