FPGARelated.com
Forums

Async Processors

Started by Jim Granville February 8, 2006
  Further to an earlier thread on ASYNC design, and cores, this in the 
news :
http://www.eet.com/news/design/showArticle.jhtml?articleID=179101800

  and a little more is here
http://www.handshakesolutions.com/Products_Services/ARM996HS/Index.html

  with some plots here:
http://www.handshakesolutions.com/assets/downloadablefile/ARM996HS_leaflet_feb06-13004.pdf

  They don't mention the Vcc of the compared 968E-S, but the Joules and 
EMC look compelling, as does the 'self tracking' aspect of Async.

  They also have an Async 8051
http://www.handshakesolutions.com/Products_Services/HT-80C51/Index.html

  -jg

Hi,
While always dealing with clock cycles, I am really surprised to learn
how a clockless CPU works. Amazing!

Weng

Jim Granville wrote:
> Further to an earlier thread on ASYNC design, and cores, this in the > news : > http://www.eet.com/news/design/showArticle.jhtml?articleID=179101800 > > and a little more is here > http://www.handshakesolutions.com/Products_Services/ARM996HS/Index.html > > with some plots here: > http://www.handshakesolutions.com/assets/downloadablefile/ARM996HS_leaflet_feb06-13004.pdf > > They don't mention the Vcc of the compared 968E-S, but the Joules and > EMC look compelling, as does the 'self tracking' aspect of Async. > > They also have an Async 8051 > http://www.handshakesolutions.com/Products_Services/HT-80C51/Index.html
I seem to recall participating in a discussion of asynch processors a while back and came to the conclusion that they had few advantages in the real world. The claim of improved speed is a red herring. The clock cycle of a processor is fixed by the longest delay path which is at lowest voltage and highest temperature. The same is true of the async processor, but the place where you have to deal with the variability is at the system level, not the clock cycle. So at room temperature you may find that the processor runs faster, but under worst case conditions you still have to get X amount of computations done in Y amount of time. The two processors will likely be the same speed or the async processor may even be slower. With a clocked processor, you can calculate exactly how fast each path will be and margin is added to the clock cycle to deal with worst case wafer processing. The async processor has a data path and a handshake path with the handshake being designed for a longer delay. This delay delta also has to have margin and likely more than the clocked processor to account for two paths. This may make the async processor slower in the worst case conditions. Since your system timing must work under all cases, you can't really use the extra computations that are available when not running under worst case conditions, unless you can do SETI calculations or something that is not required to get done. I can't say for sure that the async processor does not use less power than a clocked processor, but I can't see why that would be true. Both are clocked. The async processor is clocked locally and dedicates lots of logic to generating and propagating the clock. A clocked chip just has to distribute the clock. The rest of the logic is the same between the two. I suppose that the async processor does have an advantage in the area of noise. As SOC designs add more and more analog and even RF onto the same die, this will become more important. But if EMI with the outside world is the consideration, there are techniques to spread the spectrum of the clock that reduce the generated EMI. This won't help on-chip because each clock edge generates large transients which upset analog signals. I can't comment on the data provided by the manufacturer. I expect that you can achieve similar results with very agressive clock management. I don't recall the name of the company, but I remember recently reading about one that has cut CPU power significantly that way. I think they were building a processor to power a desktop computer and got Pentium 4 processing speeds at just 25 Watts compared to 80+ Watts for the Pentium 4. That may not convey well to the embedded world where there is less paralellism. So I am not a convert to async processing as yet.
rickman wrote:

> Jim Granville wrote: > >>Further to an earlier thread on ASYNC design, and cores, this in the >>news : >>http://www.eet.com/news/design/showArticle.jhtml?articleID=179101800 >> >> and a little more is here >>http://www.handshakesolutions.com/Products_Services/ARM996HS/Index.html >> >> with some plots here: >>http://www.handshakesolutions.com/assets/downloadablefile/ARM996HS_leaflet_feb06-13004.pdf >> >> They don't mention the Vcc of the compared 968E-S, but the Joules and >>EMC look compelling, as does the 'self tracking' aspect of Async. >> >> They also have an Async 8051 >>http://www.handshakesolutions.com/Products_Services/HT-80C51/Index.html > > > I seem to recall participating in a discussion of asynch processors a > while back and came to the conclusion that they had few advantages in > the real world. The claim of improved speed is a red herring. The > clock cycle of a processor is fixed by the longest delay path which is > at lowest voltage and highest temperature. The same is true of the > async processor, but the place where you have to deal with the > variability is at the system level, not the clock cycle. So at room > temperature you may find that the processor runs faster, but under > worst case conditions you still have to get X amount of computations > done in Y amount of time.
Yes, but systems commonly spend a LOT of time waiting on external, or time, events. The two processors will likely be the same
> speed or the async processor may even be slower. With a clocked > processor, you can calculate exactly how fast each path will be and > margin is added to the clock cycle to deal with worst case wafer > processing. The async processor has a data path and a handshake path > with the handshake being designed for a longer delay. This delay delta > also has to have margin and likely more than the clocked processor to > account for two paths.
Why ? In the clocked case, you have to spec to cover Process spreads, and also Vcc and Temp. That's three spreads. The Async design self tracks all three, and the margin is there by ratio. This may make the async processor slower in the
> worst case conditions. > > Since your system timing must work under all cases, you can't really > use the extra computations that are available when not running under > worst case conditions, unless you can do SETI calculations or something > that is not required to get done. > > I can't say for sure that the async processor does not use less power > than a clocked processor, but I can't see why that would be true.
You did look at their Joule plots ?
> Both > are clocked. The async processor is clocked locally and dedicates lots > of logic to generating and propagating the clock.
Their gate count comparisons suggest this cost is not a great as one would first think.
> A clocked chip just has to distribute the clock.
... and that involves massive clock trees, and amps of clock driver spikes, in some devices....(not to mention electro migration issues...)
> The rest of the logic is the same between > the two. > > I suppose that the async processor does have an advantage in the area > of noise.
yes. [and probably makes some code-cracking much harder...] As SOC designs add more and more analog and even RF onto the
> same die, this will become more important. But if EMI with the outside > world is the consideration, there are techniques to spread the spectrum > of the clock that reduce the generated EMI. This won't help on-chip > because each clock edge generates large transients which upset analog > signals. > > I can't comment on the data provided by the manufacturer. I expect > that you can achieve similar results with very agressive clock > management.
Perhaps, in the limiting case, yes - but you have two problems: a) That is a LOT of NEW system overhead, to manage all that agressive clock management... b) The Async core does this 'Clock management for free - it is part of the design. I don't recall the name of the company, but I remember
> recently reading about one that has cut CPU power significantly that > way. I think they were building a processor to power a desktop > computer and got Pentium 4 processing speeds at just 25 Watts compared > to 80+ Watts for the Pentium 4.
Intel are now talking of Multiple/Split Vccs on a die, including some mention of magnetic layers, and inductors, but that is horizon stuff, not their current volume die. I am sure they have an impressive road map, as that is one thing that swung Apple... :) That may not convey well to the
> embedded world where there is less paralellism. So I am not a convert > to async processing as yet.
I'd like to see a more complete data sheet, and some real silicon, but the EMC plot of the HT80C51 running indentical code is certainly an eye opener. (if it is a true comparison). It is nice to see (pico) Joules / Opcode quoted, and that is the right units to be thinking in. -jg
rickman wrote:
>The two processors will likely be the same > speed or the async processor may even be slower. With a clocked > processor, you can calculate exactly how fast each path will be and > margin is added to the clock cycle to deal with worst case wafer > processing. The async processor has a data path and a handshake path > with the handshake being designed for a longer delay. This delay delta > also has to have margin and likely more than the clocked processor to > account for two paths. This may make the async processor slower in the > worst case conditions.
There are a lot of different async technologies, not all suffer from this. Dual rail with an active ack do not rely on the handshake having a longer time to envelope the data path worst case. Phased Logic designs are one example.
> Since your system timing must work under all cases, you can't really > use the extra computations that are available when not running under > worst case conditions, unless you can do SETI calculations or something > that is not required to get done.
Using dual rail with ack, there is no worst case design consideration internal to the logic ... it's just functionally correct by design at any speed. So, if the chip is running fast, so does the logic, up until it must synchronize with the outside world.
> I can't say for sure that the async processor does not use less power > than a clocked processor, but I can't see why that would be true. Both > are clocked. The async processor is clocked locally and dedicates lots > of logic to generating and propagating the clock. A clocked chip just > has to distribute the clock. The rest of the logic is the same between > the two.
for fine grained async, there is very little cascaded logic, and as such very little transitional glitching in comparision to relatively deep combinatorials that are clocked. This transitional glitching at clocks consumes more power than just the best case behaviorial of clean transitions of all signals at clock edges and no prop or routing delays. for course grained async, the advantage obviously goes away.
> I suppose that the async processor does have an advantage in the area > of noise. As SOC designs add more and more analog and even RF onto the > same die, this will become more important. But if EMI with the outside > world is the consideration, there are techniques to spread the spectrum > of the clock that reduce the generated EMI. This won't help on-chip > because each clock edge generates large transients which upset analog > signals.
By design clocked creates a distribution of additive current spikes following clock edges, even if spread spectrum. This simply is less, if any, of a problem using async designs. Async has a much better chance of creating larger DC component in the power demand by time spreading transistions so that the on chip capacitance can filter the smaller transition spikes, instead of high the AC components with a lot of frequency components that you get with clocked designs. In the whole discussion about the current at the center of the ball array and DC currents, this was the point the was missed. If you slow the clock down enough, the current will go from zero, to a peak shortly after a clock, and back to zero, with any clocked design. To get the current profile to maintain a significant DC level for dynamic currents, requires carefully balancing multiple clock domains and using deeper than one level of LUTs with long routing to time spread the clock currents. Very Very regular designs, with short routing and a single lut depth, will generate a dynamic current spike 1-3 lut delays from the clock transition. On small chips which do not have a huge clock net skew, this will mean most of the dynamic current will occuring in a two or three lut delay window following clock transitions. Larger designs with a high distribution of multiple levels of logic and routing delays flatten this distribution out. Dual rail with ack designs just completely avoid this problem.
Jim Granville wrote:
> rickman wrote: > > > Jim Granville wrote: > > > >>Further to an earlier thread on ASYNC design, and cores, this in the > >>news : > >>http://www.eet.com/news/design/showArticle.jhtml?articleID=179101800 > >> > >> and a little more is here > >>http://www.handshakesolutions.com/Products_Services/ARM996HS/Index.html > >> > >> with some plots here: > >>http://www.handshakesolutions.com/assets/downloadablefile/ARM996HS_leaflet_feb06-13004.pdf > >> > >> They don't mention the Vcc of the compared 968E-S, but the Joules and > >>EMC look compelling, as does the 'self tracking' aspect of Async. > >> > >> They also have an Async 8051 > >>http://www.handshakesolutions.com/Products_Services/HT-80C51/Index.html > > > > > > I seem to recall participating in a discussion of asynch processors a > > while back and came to the conclusion that they had few advantages in > > the real world. The claim of improved speed is a red herring. The > > clock cycle of a processor is fixed by the longest delay path which is > > at lowest voltage and highest temperature. The same is true of the > > async processor, but the place where you have to deal with the > > variability is at the system level, not the clock cycle. So at room > > temperature you may find that the processor runs faster, but under > > worst case conditions you still have to get X amount of computations > > done in Y amount of time. > > Yes, but systems commonly spend a LOT of time waiting on external, or > time, events.
Yes, and if power consumption is important the processor can stop or even stop the clock. That is often used when power consumption is critical. That's all the async processor does, it stops its own clock. BTW, how does the async processor stop to wait for IO? The ARM processor doesn't have a "wait for IO" instruction. So it has to set an interrupt on a IO pin change or register bit change and then stop the CPU, just like the clocked processor. No free lunch here!
> The two processors will likely be the same > > speed or the async processor may even be slower. With a clocked > > processor, you can calculate exactly how fast each path will be and > > margin is added to the clock cycle to deal with worst case wafer > > processing. The async processor has a data path and a handshake path > > with the handshake being designed for a longer delay. This delay delta > > also has to have margin and likely more than the clocked processor to > > account for two paths. > > Why ? In the clocked case, you have to spec to cover Process spreads, > and also Vcc and Temp. That's three spreads. > The Async design self tracks all three, and the margin is there by ratio.
Yes, the async processor will run faster when conditions are good, but what can you do with those extra instruction cycles? You still have to design your application to execute M instructions in N amount of time under WORST CASE conditions. The extra speed is wasted unless, like I said, you want to do some SETI calcs or something that does not need to be done. The async processor just moves the synchronization to the system level where you sit and wait instead of at the gate level at every clock cycle.
> This may make the async processor slower in the > > worst case conditions. > > > > Since your system timing must work under all cases, you can't really > > use the extra computations that are available when not running under > > worst case conditions, unless you can do SETI calculations or something > > that is not required to get done. > > > > I can't say for sure that the async processor does not use less power > > than a clocked processor, but I can't see why that would be true. > > You did look at their Joule plots ?
Yes, but there are too many unknowns to tell if they are comparing apples to oranges. Did the application calculate the fibonacci series, or do IO with waits? Did the clocked processor use clock gating to disable unused sections or did every section run full tilt at all times? I have no idea how real the comparison is. Considering how the processor works I don't see where there should be a difference. Dig below the surface and consider how many gate outputs are toggling and you will see the only real difference is in the clocking itself; compare the clock tree to the handshake paths.
> > Both > > are clocked. The async processor is clocked locally and dedicates lots > > of logic to generating and propagating the clock. > > Their gate count comparisons suggest this cost is not a great as one > would first think.
But the gate count is higher in the async processor.
> > A clocked chip just has to distribute the clock. > > ... and that involves massive clock trees, and amps of clock driver > spikes, in some devices....(not to mention electro migration issues...)
You can wave your hands and cry out "massive clock trees", but you still have to distribute clocks everywhere in the async part, it is just done differently with lots of logic in the clock path and they call it a handshake. Instead of trying to minimize the clock delay, they lengthen it to exceed the logic delay.
> > The rest of the logic is the same between > > the two. > > > > I suppose that the async processor does have an advantage in the area > > of noise. > > yes. [and probably makes some code-cracking much harder...] > > As SOC designs add more and more analog and even RF onto the > > same die, this will become more important. But if EMI with the outside > > world is the consideration, there are techniques to spread the spectrum > > of the clock that reduce the generated EMI. This won't help on-chip > > because each clock edge generates large transients which upset analog > > signals. > > > > I can't comment on the data provided by the manufacturer. I expect > > that you can achieve similar results with very agressive clock > > management. > > Perhaps, in the limiting case, yes - but you have two problems: > a) That is a LOT of NEW system overhead, to manage all that agressive > clock management... > b) The Async core does this 'Clock management for free - it is part of > the design.
It is "free" the same way in any design. The clock management in a clocked part would not be software, it would be in the hardware.
> I don't recall the name of the company, but I remember > > recently reading about one that has cut CPU power significantly that > > way. I think they were building a processor to power a desktop > > computer and got Pentium 4 processing speeds at just 25 Watts compared > > to 80+ Watts for the Pentium 4. > > Intel are now talking of Multiple/Split Vccs on a die, including > some mention of magnetic layers, and inductors, but that is horizon > stuff, not their current volume die. > I am sure they have an impressive road map, as that is one thing that > swung Apple... :)
I found the article in Electronic Products, FEB 2006, "High-performance 64-bit processor promises tenfold cut in power", pp24-26. It sounds like a real hot rod with dual 2 GHz processors, dual high speed memory interfaces, octal PCI express, gigabit Ethernet and lots of other stuff. 5 to 13 Watts typical and 25 Watts max. So you can do some amazing stuff with power without going to async clocking.
> That may not convey well to the > > embedded world where there is less paralellism. So I am not a convert > > to async processing as yet. > > I'd like to see a more complete data sheet, and some real silicon, but > the EMC plot of the HT80C51 running indentical code is certainly an eye > opener. (if it is a true comparison). > > It is nice to see (pico) Joules / Opcode quoted, and that is the right > units to be thinking in.
fpga_toys@yahoo.com wrote:
> rickman wrote: > >The two processors will likely be the same > > speed or the async processor may even be slower. With a clocked > > processor, you can calculate exactly how fast each path will be and > > margin is added to the clock cycle to deal with worst case wafer > > processing. The async processor has a data path and a handshake path > > with the handshake being designed for a longer delay. This delay delta > > also has to have margin and likely more than the clocked processor to > > account for two paths. This may make the async processor slower in the > > worst case conditions. > > There are a lot of different async technologies, not all suffer from > this. > Dual rail with an active ack do not rely on the handshake having a > longer > time to envelope the data path worst case. Phased Logic designs are > one example.
Can you explain? I don't see how you can async clock logic without having a delay path that exceeds the worst path delay in the logic. There is no way to tell when combinatorial logic has settled other than to model the delay. I found some links with Google, but I didn't gain much enlightenment with the nickle tour. What I did find seems to indicate that the complexity goes way up since each signal is two signals of value and timing combined called LEDR encoding. I don't see how this is an improvement.
> > Since your system timing must work under all cases, you can't really > > use the extra computations that are available when not running under > > worst case conditions, unless you can do SETI calculations or something > > that is not required to get done. > > Using dual rail with ack, there is no worst case design consideration > internal to the logic ... it's just functionally correct by design at > any > speed. So, if the chip is running fast, so does the logic, up until it > must synchronize with the outside world.
That is the point. Why run fast when you can't make use of the extra speed? Your app must be designed for the worst case speed and anything faster is lost.
> > I can't say for sure that the async processor does not use less power > > than a clocked processor, but I can't see why that would be true. Both > > are clocked. The async processor is clocked locally and dedicates lots > > of logic to generating and propagating the clock. A clocked chip just > > has to distribute the clock. The rest of the logic is the same between > > the two. > > for fine grained async, there is very little cascaded logic, and as > such > very little transitional glitching in comparision to relatively deep > combinatorials that are clocked. This transitional glitching at clocks > consumes more power than just the best case behaviorial of clean > transitions of all signals at clock edges and no prop or routing > delays. > > for course grained async, the advantage obviously goes away.
I think you are talking about a pretty small effect compared to the overall power consumption.
> > I suppose that the async processor does have an advantage in the area > > of noise. As SOC designs add more and more analog and even RF onto the > > same die, this will become more important. But if EMI with the outside > > world is the consideration, there are techniques to spread the spectrum > > of the clock that reduce the generated EMI. This won't help on-chip > > because each clock edge generates large transients which upset analog > > signals. > > By design clocked creates a distribution of additive current spikes > following clock edges, even if spread spectrum. This simply is less, if > any, of a problem using async designs. Async has a much better chance > of creating larger DC component in the power demand by time spreading > transistions so that the on chip capacitance can filter the smaller > transition spikes, instead of high the AC components with a lot of > frequency components that you get with clocked designs. > > In the whole discussion about the current at the center of the ball > array and DC currents, this was the point the was missed. If you slow > the clock down enough, the current will go from zero, to a peak shortly > after a clock, and back to zero, with any clocked design. To get the > current profile to maintain a significant DC level for dynamic > currents, requires carefully balancing multiple clock domains and using > deeper than one level of LUTs with long routing to time spread the > clock currents. Very Very regular designs, with short routing and a > single lut depth, will generate a dynamic current spike 1-3 lut delays > from the clock transition. On small chips which do not have a huge > clock net skew, this will mean most of the dynamic current will > occuring in a two or three lut delay window following clock > transitions. Larger designs with a high distribution of multiple levels > of logic and routing delays flatten this distribution out. > > Dual rail with ack designs just completely avoid this problem.
Care to explain how Dual rail with ack operates?
rickman wrote:
> Can you explain? I don't see how you can async clock logic without > having a delay path that exceeds the worst path delay in the logic. > There is no way to tell when combinatorial logic has settled other than > to model the delay.
Worst case sync design requires that the clock period be slower than the longest worst case combinatorial path ... ALWAYS ... even when the device is operating under best case conditions. Devices with best case fab operating under best case environmentals, are forced to run just as slow as worst case fab devices under worst case environmental. The tradeoff with async is to accept that under worst case fab and worst case environmental, that the design will run a little slower because of the ack path. However, under typical conditions, and certainly under best case fab and best case environmentals, the expecation is that the ack path delay costs are a minor portion of the improvements gained by using the ack path. If the device has very small deviations in performance from best case to worst case, and the ack costs are high, then there clearly isn't any gain to be had. Other devices however, do offer this gain for certain designs. Likewise, many designs might be clock constrained by an exception path that is rarely exercised, but the worst case delay for that rare path will constrain the clock rate for the entire design. With async, that problem goes away, as the design can operate with timings for the normal path without worrying about the slowest worst case paths.
> I think you are talking about a pretty small effect compared to the > overall power consumption.
Depends greatly on the design and logic depth. For your design it might not make a difference as you suggest. For a multiplier it can be significant, as every transistion, including the glitches cost the same dynamic power.
rickman wrote:
> BTW, how does the async processor stop to wait for IO? The ARM > processor doesn't have a "wait for IO" instruction.
Yes, that has to be one of the keys. Done properly, JNB Flag,$ should spin only that opcode's logic, and activate only the small cache doing it.
> So it has to set > an interrupt on a IO pin change or register bit change and then stop > the CPU, just like the clocked processor. No free lunch here!
That's the coarse-grain way, the implementation above can drop to tiny power anywhere.
> Yes, the async processor will run faster when conditions are good, but > what can you do with those extra instruction cycles?
Nothing, the point is you save energy, by finishing earlier.
>>Their gate count comparisons suggest this cost is not a great as one >>would first think. > > > But the gate count is higher in the async processor.
Not in the 8051 example. In the ARM case, it is 89:88, pretty much even. The thing to do now, is wait for some real devices, and better data. -jg
fpga_toys@yahoo.com wrote:
> rickman wrote: > > Can you explain? I don't see how you can async clock logic without > > having a delay path that exceeds the worst path delay in the logic. > > There is no way to tell when combinatorial logic has settled other than > > to model the delay. > > Worst case sync design requires that the clock period be slower than > the > longest worst case combinatorial path ... ALWAYS ... even when the > device is operating under best case conditions. Devices with best case > fab operating under best case environmentals, are forced to run just as > slow as worst case fab devices under worst case environmental. > > The tradeoff with async is to accept that under worst case fab and > worst > case environmental, that the design will run a little slower because of > the > ack path. > > However, under typical conditions, and certainly under best case fab > and > best case environmentals, the expecation is that the ack path delay > costs > are a minor portion of the improvements gained by using the ack path. > If > the device has very small deviations in performance from best case to > worst case, and the ack costs are high, then there clearly isn't any > gain > to be had. Other devices however, do offer this gain for certain > designs. > > Likewise, many designs might be clock constrained by an exception path > that is rarely exercised, but the worst case delay for that rare path > will > constrain the clock rate for the entire design. With async, that > problem > goes away, as the design can operate with timings for the normal path > without worrying about the slowest worst case paths.
You have ignored the real issue. The issue is not whether the async design can run faster under typical conditions; we all know it can. The issue is how do you make use of that faster speed? The system design has to work in the worst case conditions, so you can only use the available performance under worse case conditions. You can do the same thing with a clocked design. Measure the temperature and run the clock faster when the temperature is cooler. It just is not worth the effort since you can't do anything useful with the extra instructions.
> > I think you are talking about a pretty small effect compared to the > > overall power consumption. > > Depends greatly on the design and logic depth. For your design it might > not make a difference as you suggest. For a multiplier it can be > significant, > as every transistion, including the glitches cost the same dynamic > power.
The glitching happens in any design. Inputs change and create changes on the gate outputs which feed other gates, etc until you reach the outputs. But the different paths will have different delays and the outputs as well as the signals in the path can jump multiple times before they settle. The micro-glitching you are talking about will likely cause little additional glitching relative to what already happens. Of course, YMMV.