Charles Krug
Guest
|
Posted:
Wed Feb 23, 2005 6:59 pm Post subject:
Re: Simple Hardware Clock question |
|
|
On Wed, 23 Feb 2005 05:34:41 +0300, Maxim S. Shatskih
<maxim@storagecraft.com> wrote:
| Quote: | In modern CPUs like the Athlon or Pentium 4, the pipeline can be as
long as 10 or 20 stages. Therefore even though each instruction takes
10 or 20 cycles, they are pipelined so that we can achieve 1
instr/cycle throughput.
More so.
The weak point of superscalar is that the decision on paralleling is done in
runtime by CPU hardware, which cannot keep large context.
The Very Long Instruction Word (VLIW) CPU like IA-64 loads this burden to the
compiler. The compiler (which can keep huge context) decides how to parallel
the operations between several CPU cores.
The back sides are huge complexity of compiler and assembler language (nearly
impossible to write manual assembler, too much context to keep in head).
|
In a former job, I was working with (among other things) TI-'c6x DSPs,
with an open pipeline VLIW architecture.
A TON of effort went into hand optimizing inner loops that exactly fit
inside the on-chip memory. TI provided an intermediate form of assembly
language that helped quite a bit--you could define your independent
instruction sequences that the optimizer could USUALLY arrange
optimally.
But we had abundant war stories about squeezing the last smidgeon of
performance out of ten instructions. |
|