VGA Output in 7 Slices. Really.

Victor YurkovskySeptember 25, 20121 comment

Ridiculous? Read on - I will show you how to generate VGA timing in seven XilinxR Spartan3R slices.

Some time ago I needed to output video to a VGA monitor for my Apple ][ FPGA clone.  Obviously (I thought), VGA's been done before and all I had to do was find some Verilog code and drop it into my design.  As is often the case (with me anyway), the task proved to be very different from my imagined 'couple of hours to integrate the IP'.

I found some example code for my board.  I managed to get it working, but it consumed a noticeable chunk of my XC3S1000!  Most of you would say "Who cares - you still have plenty of room on the FPGA for your stupid Apple ][!"  But you know me from my last article, FPGA Assemblers and Time Machines, I want to really get to the bottom of things, and I've built tools to do so.  I started looking closer at the design.  I started searching for tutorials and explanations and other projects.  Invariably the utilization seemed high.  Is it really that difficult to generate VGA timing?

Looking at available prior art showed pretty much the same design concepts.  Derive a pixel clock.  Set up a big counter.  Open up the Horizontal Sync pulse for a certain number of clocks (a comparator).  Open up Horizontal blanking around Horizontal Sync (more comparators).  Count horizontal lines (another counter).  Do vertical sync the same way (a bunch more comparators).  Add it all up, you have a lot of circuitry.

Now I was really intrigued.  I needed the following signals:

  • Horizontal blanking (visible line starts right after)
  • Horizontal sync pulse
  • Vertical blanking (first line starts right after)
  • Vertical sync

Looking at one version of VGA documentation (there are many variations) I found the following horizontal line timing:

              |____|                  |____|

       bporch hsync fporch, XXX=pixels

    component     uS   
    hsync        4.00     
    bporch       1.79
    pixels      25.42
    fporch       0.79

My clock runs at 100MHz; using it my counts would be 400, 179, 2542, 79 - requiring very wide counters.  I could make my counters smaller by creating a slower pixel clock, but that's how everyone does it so where is the fun in that?

After a few days of thinking in the bathtub, I noticed something curious.  If I use a timebase of   99  100MHz clocks,  I can drop my counters to very reasonable width - 5 bits - and still output something that very closely resembles VGA timing.  Why 99? Later on that.  For now, observe:

               99-clk    My
    component  cycles   timing   VGA spec   
    hsync         4      3.96       4.00     
    bporch        2      1.98       1.79
    pixels       25     24.75      25.42
    fporch        1      0.99       0.79
                ===     =====      =====
                 32     31.68      31.77

  |____|                  |____|
 1  4   2     25         1         31.68us scanline

Now my horizontal line is split into 32 units (99 clocks each), and each scanline pops out every 31.68uS, a little fast but close enough (monitors are pretty amazing at syncing to just about anything).  A 5-bit counter is pretty small, and 5-input comparators aren't too bad.  (And we have to generate a divide-by-99 circuit of some kind...  But let's leave that until later).  Is there a way to improve on that?.

SRL16 to the Rescue

Normally you can expect to fit 2 bits into every slice.  A 32 bit shift register would take 16 slices.  But Xilinx designers gave us an incredible gift - the ability to configure each LUT as a 2-to-16-bit shift register!  Two of them - a single slice - can be ganged up to become a 32-bit shift register.  32 bits, 32 units in our cycle... Yes, we can use a 32-bit shift register to generate Horizontal Sync and Horizontal Blanking pulses,  1 slice each!  We can loop the shift registers in on themselves, and program them to generate pulses.

All we have to do for 4-unit horizontal sync is to configure the SRL32 like this:


And blanking, if you need it:



Now, let's get back to the divide-by-99 problem.  A brute-force solution requires a 7-bit resettable counter and a 7-bit match circuit to detect 99 and generate a reset.  We could create a 99-bit-long shift register with SRLs, but that would still take up 7 LUTs.  Not bad, but we can do better.  Back to the bathtub.

I noticed that there are two independent shift registers in a slice.  Each register can be any length between 1 and 17 (the flip-flop can be attached to the 16-bit SRL16 creating the 17th element).  The two outputs can also be ANDed together using the carry chain in a clever way.  Now what if we were to load each register with a single on bit and let them roll?

If the registers are the same size, obviously, nothing interesting will happen - the two will mirror each other, and if the pulses coincide, the ANDed output will reflect them.  But what if we were to make them a different width?  Let's name the widths of the registers A and B.  If A and B are mutually prime, the resulting phasing action will create a pulse generator that has the period A*B and pulse width of one clock.

Imagine two gears connected to each other.  Gear A has 4 teeth, while gear B has 3 (yes, I know, that's why we are using our imagination).  Let's mark the spot where they touch, on both gears.  Now rotate gear A a full revolution.  4 tooth-clicks later, gear B is one tooth ahead - no match.  Another A revolution.  8-clicks later, gear B mark is now 2 teeth ahead and approaching the other side.  Another one - 12 clicks later gear B has turned an extra revolution and matches our mark.  Bingo - 12 clicks, 4 * 3.

You may have noticed that 99 is 11 * 9.  We can make a 1-slice circuit with 2 SRL16s, 11 and 9 bits wide, with outputs anded in the carry chain.  Every 99 cycles, a pulse will come out of our slice.  Perfection.

Now you see why I chose 99.  100, for instance, is impossible to achieve using this method - there are no mutually prime numbers that, when multiplied, yield 100.

Inventory so far:

    Divide-by-99           1 slice
    Horizontal sync        1 slice
    Horizontal blanking    1 slice

Squeezing the Sync SRLs even more

Well, it turns out we can do even better!  Astute readers will notice that our SRL is 32-bits long, or two 16-bit SRLs ganged together.  Xilinx designers provided a very convenient MC15 ouput of the 16th element of each SRL that can be connected with fast logic to the next SRL for ganging them together, just like we do.

We don't use the shift register taps.  Why not use them?  Let's take the horizontal sync shift register, and tap 1 cycle before the output, and using the other SRL16, tap 2 cycles after the output.  Now we have a tap that goes on just before the sync, and another that goes off 2 cycles later.  (Mis)using the carry logic, we can OR these together (they happen to partially overlap).  Now we have a single slice outputting both horizontal sync and horizontal blanking.

Inventory so far:

    Divide-by-99              1 slice
    Horizontal sync/blanking  1 slice

Vertical timing

Expressing vertical timing in horizontal lines, the timing looks like this:

    |___|                  |____|
      2  32     480      14     

Well, we could divide it by 2... But it's still many bits to count.  Can we do better?  Let's look at the timing:

    component  VGA spec
    bporch       1020
    pixels      15250 
    fporch        450
      ===       =====
    rate      59.58Hz

The vertical sync pulse has to be 2 lines, so let's leave it out for a moment.

Looks bad.  But we don't have to be exact as monitor will take up some slack.  So can we use a 32-bit shifter to generate this timing?  Are there magical units such that 32 of them make up the whole vertical period?  A little more bathtub thinking yields the following:

       .99 uS * 16 =    15.84 uS        //a hypothetical clock unit
    15 .84 uS * 33 =   522.72 uS        //an SRL33 fits into a slice
    522.72 us * 32 = 16727.04 uS        //really close to 16,784 uS!!!

In other words, if we take our .99uS timebase, multiply it by 16, and phase it with our .99uS timebase multiplied by 33, we will have a 522.72uS timebase that is perfectly 1/32 of the vertical timing period.  Then we can use it to generate the vertical blanking like this:

    Component  cycles    time       VGA spec
    bporch        2      1045.44       1020
    data         29     15158.88      15250 
    fporch        1       522.72        450
                 ===       =====       =====
                 32     16727.04      16784
    rate                 59.78Hz     59.58Hz

That is really close.

This time, we will need 1 1/2 slices - 1/2 a slice for the 16 bits, and an entire slice for the 33 bits (we will use a flip-flop to extend the 32-bit shift register). 

Now, another slice will implement an SRL32 with the 3-bit-long VBLANK pulse.

We will use up another slice to generate a 2-line delay to shape the VSYNC pulse tapped from just after the porch.

Inventory so far:

    Divide-by-99              1     slice
    Horizontal sync/blanking  1     slice
    Divide-by-528             1 1/2 slice
    Vertical blanking         1     slice
    Vertical sync             1
                  5 1/2 slices

Hold it!  Didn't I say 7 slices?  Why didn't I say 5 1/2 slices? I just didn't think you would believe me.

Xilinx and Spartan3 are registered trademarks of Xilinx Inc.

P.S. Due to popular demand, I will include some fpgasm and Verilog source in my next post.

Copyright 2012 Victor Yurkovsky

Previous post by Victor Yurkovsky:
   FPGA Assemblers and Time Machines
Next post by Victor Yurkovsky:
   StrangeCPU #1. A new CPU


[ - ]
Comment by DiegoRJanuary 11, 2015
That's impressive and really clever! Can't wait to try it out.

To post reply to a comment, click on the 'reply' button attached to each comment. To post a new comment (not a reply to a comment) check out the 'Write a Comment' tab at the top of the comments.

Registering will allow you to participate to the forums on ALL the related sites and give you access to all pdf downloads.

Sign up
or Sign in