| Author |
Message |
Guest
|
Posted:
Fri Jul 15, 2005 8:15 am Post subject:
Re: Code density and performance? |
|
|
Dennis M. O'Connor wrote:
[snip]
| Quote: | Around the time RISC was developing, memory was getting
much faster, relative to processor logic, and cheaper than it
had previously been. I contend this was one reason RISC could
successfully (for a while) abandon the densely-coded ISA's of
the CISCs for the faster-to-decode simplicity that characterized
the original RISC ISAs.
|
Another reason being the ability to implement a simple processor
on a single chip, especially with an on-chip Icache?
Interestingly memory is still cheap but is now slow (or is it
expensive and a little slow, considering on-chip L3 cache
comparable to early RISC memory?)
| Quote: | But times change, and logic got faster again. Code density mattered
again, and RISC architectures started incorporating denser
but harder to decode instructions (like ARM's Thumb and Thumb2
ISA extensions).
|
I take this as a vote that code density matters enough (even for
a high-performance ISA) that significant thought should be
devoted to compact representations of work. (I am not certain I
agree with the argument. Even for embedded systems in the
higher-performance part of the spectrum, the dense-code ISAs have
not yet taken root much less florished. I am guessing that code
density is more important in some embedded systems because
moderate-capacity persistent storage [flash] is relatively
expensive and the ratio of code memory to data memory is
relatively high [and one often pays twice for memory capacity,
once for persistent store and again for DRAM, correct?]. In more
price-constrained systems, cache [being redundant, low-density
memory] is probably also less desirable relative to more
performance-oriented systems.)
[snip]
| Quote: | that this will lead to a different "optimal" ISA for processors, an ISA that
minimizes the need for wasted communication. However, market forces
will probably cause ISA would be hidden underneath translation hardware.
|
Software translation, PLEASE! :-)
Paul A. Clayton
just a technophile
http://home.earthlink.net/~paaronclayton |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Fri Jul 15, 2005 8:15 am Post subject:
Re: Code density and performance? |
|
|
In article <42D71813.30608@Comcast.net>,
John Ahlstrom <AhlstromJK@Comcast.net> wrote:
| Quote: |
Designing ISAs for ease of code generation was first done in the late
50s and delivered in the early 60s on the B5000 et seq (perhaps
not "mainstream" systems. The 5000/5500/6700.../A-Series word mode
nstruction sets were designed for easy code generation from Algol. Somewhat
later than the B5000 (delivery in mid 60s) was the Cobol-oriented
ISA of the B3500/2500 et seq. Perhaps these too were not "mainstream"
systems.
|
Er, yes. I should have made myself clearer :-( I was thinking of
somewhat more generic models, but I realise that I made a typo. and
put 'language' instead of 'languages'. Anyway, I wasn't meaning
to put down the Burroughs, so much as to point out that there
was serious published analysis on ISA design over 35 years back,
covering many of the same topics that were rediscovered in the
mid-1980s and is being rediscovered again today.
In response to a remark of Dennis O'Connor, yes, the cheap memory
of the late 1980s was a factor in RISC getting away with it, but
the code size issue had ceased to be a primary constraint over a
decade before. It was, briefly, a rediscovered issue on the early
microprocessors and was important until recently on embedded ones,
but is almost always overstated.
My belief is that the enduring, if not endearing, infatuation with
code size is because it is one of the few issues simple enough for
inexperienced amateurs to grasp.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Dennis M. O'Connor
Guest
|
Posted:
Fri Jul 15, 2005 8:15 am Post subject:
Re: Code density and performance? |
|
|
iain-3@truecircuits.com wrote:
| Quote: | Dennis> And I'm going to claim that for CPUs running full-feature
Dennis> OS's, that graph is worthless
Well, yeah. Can you give me a pointer to some public data which
is better?
|
Not offhand. Surely you can find some yourself, though.
I'm sure the IEEE and ACM journals have papers on such.
| Quote: | In any case, although you certainly wouldn't want to rely on the
specific shape of the curve, other results that I've read have
indicated that cache miss rate continues to fall as cache sizes
get bigger -- compulsory misses almost never dominate.
|
Misses caused by task switching aren't mainly compulsory misses.
| Quote: | Dennis> At some point, your bigger cache actually leads to lower
Dennis> performance, because the better hit rate just doesn't
Dennis> make up for the slowdown.
But note that that point is greater than 2MB for secondary caches
for a low-power x86 implementation (named Banias). Or did I get
the number wrong?
|
Why do you say that ? Do you have numbers to back that claim,
comparing, say, the performance of a CPU with a 2MB L2 against
that of a CPU with a faster 1MB cache ? I'm not saying you are
wrong, mind you. But you are making a claim without any apparant
support beyond "it must be better because that's what they did".
| Quote: | Iain> The gain here is that the rename logic can be integrated into
Iain> the fetch pipeline, with no FIFO between the I$ and rename.
Iain> That FIFO adds fetch latency and thus mispredict latency.
Dennis> I don't see how that gain comes from knowing how many
Dennis> register reads and writes you are doing. What am I missing ?
If the fetch pipeline integrates rename with no intervening FIFO,
then for X bytes fetched, you will have Y read renames and Z write
renames. If you can predict, when launching the fetch into the pipe,
how many renames are needed, then you can throttle the pipe to match
your rename bandwidth at the top of the pipe.
|
Ah, you are assuming a microarchitecture that depends heavily
on register renaming. Not all of them do, you know, and that
may not be the "sweet spot" in the future.
| Quote: | Iain> Compressing the instruction stream is good.
Dennis> Now MEM op MEM is such a pain, I'm all for not hitting the
Dennis> data TLB twice. But some kind of MEM op REG, that just
Dennis> might be the new sweet spot ...
Yep. x86 looks pretty good if you use the right subset. And of
course MEM <- MEM op REG is just fine when it's the same MEM: one
TLB access.
|
But two data cache access, even if one can be done opportunisically
via a write buffer. I'm not sure that's worth it. REG <- MEM op REG
seems easier to pipeline, to me. |
|
| Back to top |
|
 |
Guest
|
Posted:
Fri Jul 15, 2005 8:15 am Post subject:
Re: Code density and performance? |
|
|
Dennis> And I'm going to claim that for CPUs running full-feature
Dennis> OS's, that graph is worthless
Well, yeah. Can you give me a pointer to some public data which
is better?
In any case, although you certainly wouldn't want to rely on the
specific shape of the curve, other results that I've read have
indicated that cache miss rate continues to fall as cache sizes
get bigger -- compulsory misses almost never dominate.
And yep, many folks I've worked with would agree that big
associative TLBs are better than simple sims indicate.
Dennis> At some point, your bigger cache actually leads to lower
Dennis> performance, because the better hit rate just doesn't
Dennis> make up for the slowdown.
But note that that point is greater than 2MB for secondary caches
for a low-power x86 implementation (named Banias). Or did I get
the number wrong?
Iain> The gain here is that the rename logic can be integrated into
Iain> the fetch pipeline, with no FIFO between the I$ and rename.
Iain> That FIFO adds fetch latency and thus mispredict latency.
Dennis> I don't see how that gain comes from knowing how many
Dennis> register reads and writes you are doing. What am I missing ?
If the fetch pipeline integrates rename with no intervening FIFO,
then for X bytes fetched, you will have Y read renames and Z write
renames. If you can predict, when launching the fetch into the pipe,
how many renames are needed, then you can throttle the pipe to match
your rename bandwidth at the top of the pipe.
Otherwise, with no FIFO, you'll occasionally fetch an instruction
which needs more rename ports than are available. The pipeline gets
the wrong answer and you restart it. I suppose you can tell the
compiler writers to put lotsa-register instructions near ops with
constants, to cut down how often this happens. I've never analyzed
that alternative.
And, of course, on some RISC pipes the prediction is trivial: each
32-bit instruction fetch will need two reads and one write. No
throttling necessary.
Iain> Compressing the instruction stream is good.
Dennis> Now MEM op MEM is such a pain, I'm all for not hitting the
Dennis> data TLB twice. But some kind of MEM op REG, that just
Dennis> might be the new sweet spot ...
Yep. x86 looks pretty good if you use the right subset. And of
course MEM <- MEM op REG is just fine when it's the same MEM: one
TLB access. |
|
| Back to top |
|
 |
Guest
|
Posted:
Fri Jul 15, 2005 8:15 am Post subject:
Re: Architecture in an ear of ligh communication cost was Re |
|
|
Stephen Fuld wrote:
[snip]
| Quote: | Yes, thanks. That makes a lot of sense. So, if I may ask, or
at least try to start a "general interest thread" (note subject
change) what are the characteristics of a thread that is
designed for an era where communication cost is expensive (in
the sense Dennis talkes about above). What are good things and
what are bad? How will this change underlying architecture and,
if one is willing to abandon legacy ISAs (say for an embedded
processor), how will it change ISAs?
|
I would guess that a new ISA would have more explicit
(software-managed) partitioning of work and more explicit
management of communication. (Offloading more of this to
software would allow simpler, smaller hardware and a more
efficient communication hierarchy/network.) This implies more
threading, architected clustering of registers and functional
units, possibly more explicit control of the memory hierarchy.
It might also be desirable for basic units of logical work to
become larger (and presumably this would be somewhat exposed in
the ISA to reduce hardware complexity/size) given 'cheap' logic.
In addition to horizontal clustering (perhaps somewhat like
MAJC?), vertical clustering (along the length of the pipeline)
might be exposed in the ISA. (E.g., one might have 'front-end'
registers [branch condition registers, jump target registers],
'early-stage' registers [global address registers and possibly a
stack register, registers whose values change relatively
infrequently and which are used to access associated
'early-stage' cache; this might include jump table lookups
{reducing the length of communication of jump targets}],
'middle-stage' registers [address and offset registers {with the
main fast cache}, perhaps flag values and even shift values],
'fast-execution-stage' registers [where result latency is
important] and perhaps a 'slow-execution-stage'.)
There might also be some attraction to explicit small value
registers (reducing register file size and communication width),
obviously for condition registers but also for other values.
Since intra-core multithreading is likely to be attractive (to
hide communication latency) there might be an incentive to
provide for register-based communication between threads (this
can be helpful with threads in separate cores as well). There
would probably also be some incentive for non-privileged thread
control operations (in which the OS would only get involved when
a hardware resource limit was reached?).
One might also want an ISA that has a nice hierarchy of decoding
such that farther levels of the memory hierarchy might have
denser but less decode-friendly representations. Decompression
work can be less expensive in energy and latency than going to
the next level of the hierarchy. OTOH, one also does not want
the common case (L1 hit) to be excessively slow and/or
power-using since decode logic would tend to be faster and less
energetic than accessing a large memory array. Farther levels of
memory could work on larger blocks and so recognize a different
level of redundancy/compressibility.
Good things would seem to be those that hide latency
(multithreading [especially intra-core], e.g.), make the common
case fast/small (clustering and hierarchies of communication,
e.g.), and exploit pipelining (perhaps with the motto "do [even
partial] things as soon as possible"?). Bad things would seem to
be those that unnecessarily expose latency (e.g., a load
jump-pointer and jump instruction might be good for code density
but bad for not allowing the load to be scheduled earlier
[assuming prediction is problematic]), that excessively
generalize (e.g., a huge L1 cache to increase the hit rate but
which increases average power and average access time), or that
try to do everything at one time.
Paul A. Clayton
just a technophile, not even a CS student
http://home.earthlink.net/~paaronclayton |
|
| Back to top |
|
 |
Wilco Dijkstra
Guest
|
Posted:
Fri Jul 15, 2005 1:48 pm Post subject:
Re: Code density and performance? |
|
|
<jon@beniston.com> wrote in message
news:1121353952.721161.61300@g49g2000cwa.googlegroups.com...
| Quote: | MIPS-16 / Thumb aren't the only way to do it. There are architectures
that support intermixed 16/32-bit instructions. These don't have the
same performance penalty.
|
Yes, Thumb-2 does mix 16 and 32-bit instructions and can achieve
the same performance as ARM, but at Thumb-1 codesize.
| Quote: | A decrease in code size would most likely lead to an increase in
instruction cache hit rate. This obviously depends upon the
implementation of other microarchitectural features such as prefetch
though.
|
This effect can be quite large on a small cache (8 or 16KB), and is still
significant on 32 and 64KB I-caches. On a context switch you get the
working set back with far fewer cache misses/page faults. You can spend
extra transistors on the cache (ARM needs a 50% larger I-cache for identical
performance) or on the memory hierarchy (L2, wider bus - more area and
extra power) or on an instruction decoder that can deal with 16/32-bit
instructions (cheap compared to the other alternatives).
In the embedded world going for code density is definitely the best tradeoff.
Wilco |
|
| Back to top |
|
 |
Torben Ęgidius Mogensen
Guest
|
Posted:
Fri Jul 15, 2005 1:51 pm Post subject:
Re: Code density and performance? |
|
|
"Dennis M. O'Connor" <dmoc@primenet.com> writes:
| Quote: | ""Torben Ęgidius Mogensen"" <torbenm@app-4.diku.dk> wrote in ...
Dysthymicdolt@aol.com writes:
Torben Ęgidius Mogensen wrote:
- Provide instructions to load/store an interval/set of registers on
stack to reduce size of procedure prologues/epilogues.
Optimizing procedure overhead is an attractive idea,
but I am concerned that adding such would increase hardware
complexity and pipeline length.
It has been a part of the ARM ISA since ARM1, so it can't be that
difficult to implement.
That's very poor reasoning indeed. Some ISA features are easy
in primitive systems that lack caches, out-of-order completion,
split transaction buses and MMUs, but become a pain in the ass
to get right when these things are introduced. For example,
in a non-MMU system, you don't have to worry about a page fault
occurring in the middle of your LDM/STM, which is a PITA.
|
True. I believe early ARMs had an error exactly when this happened.
| Quote: | LDM/STM is not too hard, but there are a lot of things you
have to keep in mind to make sure it gets done in the face
of all the events that might occur during its execution, and
you do need to add a little hardware to the pipeline to do it.
One of the motivational philosophies of RISC was to avoid
things that added complexity without adding significant
performance. Given the high hit rate of instruction caches,
you can argue that LDM/STM may not be a win compared to
a sequence of, say, LDRD/STRD (double-word load/stores),
especially if your ISA is efficiently encoded (like IA32 and Thumb).
And then you go off and simulate that, for the particular
implementation technology of the say, and see.
|
The main issue here was code density, not efficiency, so replacing
LDM/STM by a sequence of load/stores is not a good idea.
| Quote: | On a related note, one mistake naive architects often make
is thinking that the "goodness" of a particular ISA feature
is independent of the underlying implementation technology.
This is demonstrably untrue.
|
Definitely. But code density less so than efficiency, though some
choices for making code more dense can adversely affect efficiency.
Torben |
|
| Back to top |
|
 |
Torben Ęgidius Mogensen
Guest
|
Posted:
Fri Jul 15, 2005 2:05 pm Post subject:
Re: Code density and performance? |
|
|
"Stephen Fuld" <s.fuld@PleaseRemove.att.net> writes:
| Quote: | Dysthymicdolt@aol.com> wrote in message
news:1121349929.901360.151270@g44g2000cwa.googlegroups.com...
Torben Ęgidius Mogensen wrote:
[snip]
- Use two-address instructions, i.e., A := A+B instead of C := A+B.
Coalescing register allocators can in most cases eliminate the
extra moves required by two-address code (assuming you have more
than a handful of available registers).
The cited paper proposed this also, though it used a 3-operand
encoding when appropriate to avoid the extra move instruction.
One variient that I have thought of that might make sense here is to spend a
single bit in the instruction that when set makes the instruction A+1 = A +
B. That is, if set, the destination is the "next" register from the first
source. This complicates register allocation but saves the extra moves
without using 4-5 bits for a full register specifier.
|
A more drastic way of saving bits for register specification is to
allow only a small subset of the N^2 or N^3 possible pairs and triples
of registers, selected more or less at random. If the desired pair or
triple isn't available, a sequence of moves may be required to move
the operands into the right registers first (and it may get longer if
the registers on the shortests paths are occupied). The register
allocator will need to minimize the number of such moves.
The problem is obviously NP-complete, but heuristics may be found.
Torben |
|
| Back to top |
|
 |
Dennis M. O'Connor
Guest
|
Posted:
Fri Jul 15, 2005 2:15 pm Post subject:
Re: Code density and performance? |
|
|
Torben Ęgidius Mogensen wrote:
| Quote: | "Dennis M. O'Connor" <dmoc@primenet.com> writes:
One of the motivational philosophies of RISC was to avoid
things that added complexity without adding significant
performance. Given the high hit rate of instruction caches,
you can argue that LDM/STM may not be a win compared to
a sequence of, say, LDRD/STRD (double-word load/stores),
especially if your ISA is efficiently encoded (like IA32 and Thumb).
And then you go off and simulate that, for the particular
implementation technology of the say, and see.
The main issue here was code density, not efficiency, so replacing
LDM/STM by a sequence of load/stores is not a good idea.
|
That depends. I've seen the distribution of the number
of registers saved and restored by compiler-generated
LDM/STM for ARM, and it is illuminating data. Given the
extra bits needed to spec which regs to load or store
in a LDM/STM, one can imagine an ISA where an equivalent
series of LD or ST ops would not take many more bits
to encode. Given that you could then remove all the decode
and pipeline cruft associated with LDM/STM, this might
be a win even in the face of the slightly higher I-Cache
miss rate. I'd have to build a simulator and run some
workloads to know for sure though, and I don't do that
kind of stuff for a living, although I used to.
| Quote: | On a related note, one mistake naive architects often make
is thinking that the "goodness" of a particular ISA feature
is independent of the underlying implementation technology.
This is demonstrably untrue.
Definitely. But code density less so than efficiency, though some
choices for making code more dense can adversely affect efficiency.
|
"Efficiency" can mean efficiency in use of gates, or time,
or power, or engineer-years to get it out the door. :-)
Soon, I think, only the latter two will really matter
in CPUs designed for full-featured OS's.
--
Dennis M. O'Connor
Opinions do not reflect ... |
|
| Back to top |
|
 |
Christoph Breitkopf
Guest
|
Posted:
Fri Jul 15, 2005 2:46 pm Post subject:
Re: Code density and performance? |
|
|
Hello Nick,
nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
| Quote: | In response to a remark of Dennis O'Connor, yes, the cheap memory
of the late 1980s was a factor in RISC getting away with it, but
the code size issue had ceased to be a primary constraint over a
decade before. It was, briefly, a rediscovered issue on the early
microprocessors and was important until recently on embedded ones,
but is almost always overstated.
|
"Memory System Characterization of Commercial Workloads"
(http://research.compaq.com/wrl/projects/Database/isca98_1.pdf)
states a rather bad instruction cache behavior for an Oracle
SMTP application. Since then, software has grown, and, thanks
to object-orientation, code that once was one big chunk is now
spread into methods all over the code space. Still, I am not
knowledgeable enough about caches to judge if the is really
a factor in overall performance.
Regards,
Chris |
|
| Back to top |
|
 |
Guest
|
Posted:
Fri Jul 15, 2005 4:15 pm Post subject:
Re: Architecture in an ear of ligh communication cost was Re |
|
|
Stephen Fuld wrote:
| Quote: | "Dennis M. O'Connor" <dmoc@primenet.com> wrote in message
news:1121374006.375191@nnrp1.phx1.gblx.net...
"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote in ...
This is interesting. What "level" of implementation technology are you
talking about? At the level of say "CMOS" (as opposed to bipolar or
GaAs), that seems unlikely to change very much.
Around the time RISC was developing, memory was getting
much faster, relative to processor logic, and cheaper than it
had previously been. I contend this was one reason RISC could
successfully (for a while) abandon the densely-coded ISA's of
the CISCs for the faster-to-decode simplicity that characterized
the original RISC ISAs.
But times change, and logic got faster again. Code density mattered
again, and RISC architectures started incorporating denser
but harder to decode instructions (like ARM's Thumb and Thumb2
ISA extensions).
Now we are seeing a change within the processor logic itself:
the ratio of logic speed to communication speed is changing,
as is the ratio of power dissipated by wire loads as opposed to gates.
6 years ago, on-chip communication days were small compared
to logic delays, and the power dissipated by wide busses going
across a chip could be neglected. Not so now. Communication
is now expensive in both time and power, the two things you
shouldn't be wasting in a modern CPU. This is what will eventually
kill wide superscalar designs: communication overhead. And I contend
that this will lead to a different "optimal" ISA for processors, an ISA
that
minimizes the need for wasted communication. However, market forces
will probably cause ISA would be hidden underneath translation hardware.
Anyway, does that clarify my meaning ?
Yes, thanks. That makes a lot of sense. So, if I may ask, or at least try
to start a "general interest thread" (note subject change) what are the
characteristics of a thread that is designed for an era where communication
cost is expensive (in the sense Dennis talkes about above). What are good
things and what are bad? How will this change underlying architecture and,
if one is willing to abandon legacy ISAs (say for an embedded processor),
how will it change ISAs?
|
To make the topic even more general, I'm going to try to go 'back to
basics'.
Let us say that one has a very nice microprocessor that fits on a given
die - such as the Pentium IV or a comparable chip.
Then, the capability comes along to make bigger chips. What will you
add, so as to provide improved performance?
There are a number of nice things you could add.
You could add a second processor core.
You could add more cache.
You could still improve the arithmetic unit; instead of using the
multiplier unit several times in a row to do a divide, you could have
several multipliers lined up in a division pipeline, and start one
floating-point division every clock cycle.
You could widen the bus connecting main memory to the cache.
For today's chips, I would tend to vote for more cache ahead of a
second processor core, because effective utilization of a single
processor core already requires multithreading, and if multiple threads
are executing, the effective size of the cache is divided between them.
What I would *really* like to do, of course, is just make everything
faster: i.e., by using ECL instead of CMOS, or GaAs instead of silicon.
But at present, that isn't remotely possible - while technologies exist
to make faster gates than those in CMOS microprocessors, the number of
gates per chip falls off so drastically that improvement does not
appear possible in this direction.
Since there were some MIPS chips done in ECL, however, it _is_ possible
to make a microprocessor with hardware floating point in ECL; and as we
are getting into the range where the gains from using more transistors
on a chip are ever more marginal, if ECL is not totally abandoned, if
that technology continues to improve, it might once again become worth
considering. Improvements in chip size with GaAs or some other high
electron mobility semiconductor are also taking place - but very
gradually.
Possibly some other technology, like plastic semiconductors, or
amorphous silicon, will improve things in the other direction, allowing
unlimited transistor budgets but for transistors not quite as fast as
those on silicon CMOS chips.
And then there's the other radical way to increase transistor budgets:
replace lithography with something else, so that the 154 nanometer
barrier can be shattered - the difficulty of extreme ultraviolet
lithography being so great that extraordinary techniques are being used
to get as much mileage as possible out of current ultraviolet
wavelengths.
Perhaps, with current technologies improving as they are, a direction
of improvement might be to put a relatively simple microprocessor on
the same chip as some dynamic RAM instead of just cache - and perhaps
even dense non-volatile RAM so that the *hard drive* is in effect on
the chip too.
John Savard |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Fri Jul 15, 2005 6:45 pm Post subject:
Re: Architecture in an ear of ligh communication cost was R |
|
|
Dennis M. O'Connor wrote:
| Quote: | But then, I like the lawyers I've worked with at Intel. Given the choice
between a random Intel patent lawyer and a random Intel manager,
I'd pick the lawyer. ]
|
Wow!!!
That is either the best reference I've ever seen to a group of lawyers,
or the worst reference to a group of managers. :-)
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Stephen Fuld
Guest
|
Posted:
Fri Jul 15, 2005 9:36 pm Post subject:
Re: Architecture in an ear of ligh communication cost was R |
|
|
"Dennis M. O'Connor" <dmoc@primenet.com> wrote in message
news:1121399622.907039@nnrp1.phx1.gblx.net...
| Quote: | "Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote ...
"Dennis M. O'Connor" <dmoc@primenet.com> wrote ...
[...]
Now we are seeing a change within the processor logic itself:
the ratio of logic speed to communication speed is changing,
as is the ratio of power dissipated by wire loads as opposed to gates.
6 years ago, on-chip communication days were small compared
to logic delays, and the power dissipated by wide busses going
across a chip could be neglected. Not so now. Communication
is now expensive in both time and power, the two things you
shouldn't be wasting in a modern CPU. This is what will eventually
kill wide superscalar designs: communication overhead. And I contend
that this will lead to a different "optimal" ISA for processors, an ISA
that minimizes the need for wasted communication. However, market
forces will probably cause ISA would be hidden underneath translation
hardware.
I don't know if anyone else at my employer agrees, though.
I'm not connected to processor design there anymore.
So, if you can say, what kinds of things are you doing there now?
Sorry, I can't say.
|
No problem. It was worth asking.
| Quote: |
Anyway, does that clarify my meaning ?
Yes, thanks. That makes a lot of sense. So, if I may ask, or at least
try to start a "general interest thread" (note subject change) what are
the characteristics of a thread that is designed for an era where
communication
By "thread" do you mean ISA ? Or microarchitecture ?
|
Sorry. I was thinking about changing the thread subject and simply used the
word thread again when I should have used microacrcitecture or, as I stated
below, ISA.
I understand that "proprietary isues" may prevent you from saying much, but
others may be more free to pontificate..
--
- Stephen Fuld
e-mail address disguised to prevent spam |
|
| Back to top |
|
 |
Stephen Fuld
Guest
|
Posted:
Fri Jul 15, 2005 9:36 pm Post subject:
Re: Architecture in an ear of ligh communication cost was Re |
|
|
<jsavard@ecn.ab.ca> wrote in message
news:1121431116.645085.68990@g14g2000cwa.googlegroups.com...
| Quote: | Stephen Fuld wrote:
"Dennis M. O'Connor" <dmoc@primenet.com> wrote in message
news:1121374006.375191@nnrp1.phx1.gblx.net...
"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote in ...
This is interesting. What "level" of implementation technology are
you
talking about? At the level of say "CMOS" (as opposed to bipolar or
GaAs), that seems unlikely to change very much.
Around the time RISC was developing, memory was getting
much faster, relative to processor logic, and cheaper than it
had previously been. I contend this was one reason RISC could
successfully (for a while) abandon the densely-coded ISA's of
the CISCs for the faster-to-decode simplicity that characterized
the original RISC ISAs.
But times change, and logic got faster again. Code density mattered
again, and RISC architectures started incorporating denser
but harder to decode instructions (like ARM's Thumb and Thumb2
ISA extensions).
Now we are seeing a change within the processor logic itself:
the ratio of logic speed to communication speed is changing,
as is the ratio of power dissipated by wire loads as opposed to gates.
6 years ago, on-chip communication days were small compared
to logic delays, and the power dissipated by wide busses going
across a chip could be neglected. Not so now. Communication
is now expensive in both time and power, the two things you
shouldn't be wasting in a modern CPU. This is what will eventually
kill wide superscalar designs: communication overhead. And I contend
that this will lead to a different "optimal" ISA for processors, an ISA
that
minimizes the need for wasted communication. However, market forces
will probably cause ISA would be hidden underneath translation
hardware.
Anyway, does that clarify my meaning ?
Yes, thanks. That makes a lot of sense. So, if I may ask, or at least
try
to start a "general interest thread" (note subject change) what are the
characteristics of a thread that is designed for an era where
communication
cost is expensive (in the sense Dennis talkes about above). What are
good
things and what are bad? How will this change underlying architecture
and,
if one is willing to abandon legacy ISAs (say for an embedded processor),
how will it change ISAs?
To make the topic even more general, I'm going to try to go 'back to
basics'.
Let us say that one has a very nice microprocessor that fits on a given
die - such as the Pentium IV or a comparable chip.
Then, the capability comes along to make bigger chips. What will you
add, so as to provide improved performance?
|
I think Dennis' point is that "bigger" chips won't provide more performance,
as communication costs dominate over transistor/computing costs. So simply
adding more functions, or even making basic transistors faster won't help
much if you can't reduce the distances that signals have to travel.
--
- Stephen Fuld
e-mail address disguised to prevent spam |
|
| Back to top |
|
 |
Stephen Fuld
Guest
|
Posted:
Fri Jul 15, 2005 9:46 pm Post subject:
Re: Code density and performance? |
|
|
<Dysthymicdolt@aol.com> wrote in message
news:1121377777.804266.117240@g43g2000cwa.googlegroups.com...
| Quote: | Stephen Fuld wrote:
[snip]
One variient that I have thought of that might make sense here is to
spend a
single bit in the instruction that when set makes the instruction A+1 =
A +
B. That is, if set, the destination is the "next" register from the
first
source. This complicates register allocation but saves the extra moves
without using 4-5 bits for a full register specifier.
Rather than use incrementing, I think one would want to use
bit replacement (which is logically simpler).
|
I'm not sure what you mean here. If you use a bit set to indicate use a
different register for the destination, then you could only use a subset,
say the even numbered registers for the basic specification and then
substitute the "next" odd numbered one for the results (i.e substituting
that bit for the low order bit of the register specifier for the first
source). While this could certainly work, it halves the number of register
pairs you could use for the purpose.
Note that while the increment does take longer than "bit substitution", the
result of the addition is the destination register, which isn't needed
untill after the operation of the instruction, so it isn't as bad as
delaying getting a source register specification.
--
- Stephen Fuld
e-mail address disguised to prevent spam |
|
| Back to top |
|
 |
|
|
|
|