| Author |
Message |
Guest
|
Posted:
Wed Jul 13, 2005 4:15 pm Post subject:
Code density and performance? |
|
|
For a new ISA intended to be suitable for markets from
higher-performance embedded systems to server/workstation-class
systems (implying superscalar implementations and a 64-bit
architecture among other things), would it be worthwhile seeking
code density greater than a relatively 'pure' RISC like MIPS or
Alpha? Since all implementations would have an on-chip L2 cache,
it seems likely that code density would have only a moderate
impact on performance for common codes. OTOH, all else being
equal (which it would not) expending somewhat significant effort
for even a modest _architected_ performance gain seems worthwhile
in that the benefit would apply to all implementations.
Heidi Pan's "High Performance, Variable-Length Instruction
Encodings" implies that a 25% reduction in code size relative to
MIPS should be possible without making a superscalar
implementation excessively difficult. A more constrained
instruction encoding (with simpler decoding) might lose half of
that advantage, so the question might be: is a 10-15% reduction in
code size worth a signficant effort in architecture design and a
moderate additional effort in implementations? (Even before
reading that paper I had been thinking about a 64-bit instruction
word ISA that would encode more than two instructions on average
by exploiting unused bits in common RISC encodings with modestly
more variability in encodings but fixed locations for register
identifiers and parts of opcodes. The ISA might also have
somewhat more complex instructions than a pure RISC--e.g.,
load/store register pair--and highly implicit encodings for some
operations--e.g., a very short form for subroutine return with
the link register being implicit. Unlike a traditional LIW
architecture, it would be possible to jump into the middle of a
word; the entire instruction word would still be fetched but the
first operation would be converted to a no-op.)
In addition to decode complexity constraints, there is also the
question of whether such places an excessive burden on compiler
developers (at least in terms of generating dense code).
Unfortunately the standard methods for achieving greater code
density tend to increase hardware complexity, compiler
complexity, or both. (A smaller register count can make
CSE-optimizations more difficult to evaluate as well as generally
making optimized register allocation more complex. Implicit
arguments make instruction decoding more complex. Special
purpose registers [even a division as general as address and
integer] makes register allocation more complex. Modal
extensions of opcodes adds complexity to both hardware and
software. Variable length encodings increase hardware
complexity. Complex instructions add both hardware and software
complexity.)
(Since a new ISA in this target market has minimal chance of
being implemented [especially an ISA developed by a mere
technophile], this is a highly theoretical question.)
Paul A. Clayton
Not speaking for Big Sam's Anatomical Warehouse
http://home.earthlink.net/~paaronclayton/Big_Sams |
|
| Back to top |
|
 |
jon@beniston.com
Guest
|
Posted:
Thu Jul 14, 2005 2:26 pm Post subject:
Re: Code density and performance? |
|
|
| Quote: | Heidi Pan's "High Performance, Variable-Length Instruction
Encodings" implies that a 25% reduction in code size relative to
MIPS should be possible without making a superscalar
implementation excessively difficult.
|
You can get a 40% reduction with just a mix of 16 and 32-bit
instructions, without harming performance or making superscalar
execution more difficult than it otherwise would be. And yes, for
embedded apps, 40% reduction in code size is well worth it, that's why
we have MIPS-16, Thumb, ARCcompact etc.
Cheers,
Jon |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Thu Jul 14, 2005 2:35 pm Post subject:
Re: Code density and performance? |
|
|
In article <1121333201.729594.240150@g47g2000cwa.googlegroups.com>,
"jon@beniston.com" <jon@beniston.com> writes:
|> > Heidi Pan's "High Performance, Variable-Length Instruction
|> > Encodings" implies that a 25% reduction in code size relative to
|> > MIPS should be possible without making a superscalar
|> > implementation excessively difficult.
|>
|> You can get a 40% reduction with just a mix of 16 and 32-bit
|> instructions, without harming performance or making superscalar
|> execution more difficult than it otherwise would be. And yes, for
|> embedded apps, 40% reduction in code size is well worth it, that's why
|> we have MIPS-16, Thumb, ARCcompact etc.
In the past, I have estimated that you could get a 50-75% reduction
by a two-level ISA, where the top level was designed for the high
level language. Designing ISAs for ease of code generation was
first proposed in the late 1960s, as far as I know, but was never
done in a mainstream system, and hasn't been attempted at all in
recent decades.
In order to do this, you need a microcode approach, possibly with
a programmable microcode. In the heyday of the RISC dogma, this
was stated to be incompatible with performance, but the Pentium 4
and Opteron have shown that that is now false, if it ever were
true (which is very doubtful).
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Torben Ęgidius Mogensen
Guest
|
Posted:
Thu Jul 14, 2005 3:13 pm Post subject:
Re: Code density and performance? |
|
|
nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
| Quote: | In article <1121333201.729594.240150@g47g2000cwa.googlegroups.com>,
"jon@beniston.com" <jon@beniston.com> writes:
|> > Heidi Pan's "High Performance, Variable-Length Instruction
|> > Encodings" implies that a 25% reduction in code size relative to
|> > MIPS should be possible without making a superscalar
|> > implementation excessively difficult.
|
|> You can get a 40% reduction with just a mix of 16 and 32-bit
|> instructions, without harming performance or making superscalar
|> execution more difficult than it otherwise would be. And yes, for
|> embedded apps, 40% reduction in code size is well worth it, that's why
|> we have MIPS-16, Thumb, ARCcompact etc.
In the past, I have estimated that you could get a 50-75% reduction
by a two-level ISA, where the top level was designed for the high
level language. Designing ISAs for ease of code generation was
first proposed in the late 1960s, as far as I know, but was never
done in a mainstream system,
|
You can argue that Digital's VAX did this.
| Quote: | and hasn't been attempted at all in recent decades.
|
Mainly because, at this level, code generation is fairly trivial.
| Quote: | In order to do this, you need a microcode approach, possibly with
a programmable microcode. In the heyday of the RISC dogma, this
was stated to be incompatible with performance, but the Pentium 4
and Opteron have shown that that is now false, if it ever were
true (which is very doubtful).
|
70's minicomputers often used interpreted microcode, which is bad for
performance. What you see in P4 etc. is compiled microcode where each
instruction is decoded by a combinatoric circuit and translated into
microinstructions all in hardware. This is quite different from
microcode that would decode instructions in software (or "firmware",
as it was often called).
In any case, there are many simple ways to increase code density
without seriously complicating implementation or impacting
performance:
- Use two-address instructions, i.e., A := A+B instead of C := A+B.
Coalescing register allocators can in most cases eliminate the
extra moves required by two-address code (assuming you have more
than a handful of available registers).
- Use mixed-length instructions. If you make sure no instruction
spans a word boundary, this isn't too difficult to handle. This
can be handled, e.g., by using only a few different ways to pack
instruction into a word, for example, either one 31 bit
instruction, two 15-bit instructions or three 10-bit instructions
into one 32-bit word (using the first one or two bits to select the
format).
- Provide instructions to load/store an interval/set of registers on
stack to reduce size of procedure prologues/epilogues.
Torben |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Thu Jul 14, 2005 3:47 pm Post subject:
Re: Code density and performance? |
|
|
In article <7zhdexr8z2.fsf@app-4.diku.dk>,
torbenm@app-4.diku.dk (=?iso-8859-1?q?Torben_=C6gidius_Mogensen?=) writes:
|> >
|> > In the past, I have estimated that you could get a 50-75% reduction
|> > by a two-level ISA, where the top level was designed for the high
|> > level language. Designing ISAs for ease of code generation was
|> > first proposed in the late 1960s, as far as I know, but was never
|> > done in a mainstream system,
|>
|> You can argue that Digital's VAX did this.
Only feebly. It was a very bad design if that was an objective,
for reasons that were well-known before 1975.
|> > and hasn't been attempted at all in recent decades.
|>
|> Mainly because, at this level, code generation is fairly trivial.
Yes. At the cost of larger code, much more difficulty debugging,
and making it much harder to write correct interrupt handling.
|> 70's minicomputers often used interpreted microcode, which is bad for
|> performance. What you see in P4 etc. is compiled microcode where each
|> instruction is decoded by a combinatoric circuit and translated into
|> microinstructions all in hardware. This is quite different from
|> microcode that would decode instructions in software (or "firmware",
|> as it was often called).
Hmm. What makes that different from the argument that interpreted
and compiled programming languages are entirely different? That
viewpoint is generally agreed to be misleading.
|> In any case, there are many simple ways to increase code density
|> without seriously complicating implementation or impacting
|> performance: ...
The techniques that I am thinking of could actually SIMPLIFY
implementation! They would do that by removing much of the need
for the baroque tweaks (often involving interrupt handling) to
deal with the cases that are almost, but not entirely, handled in
hardware.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Jul 14, 2005 4:15 pm Post subject:
Re: Code density and performance? |
|
|
Dysthymicdolt@aol.com wrote:
| Quote: | Heidi Pan's "High Performance, Variable-Length Instruction
Encodings" implies that a 25% reduction in code size relative to
MIPS should be possible without making a superscalar
implementation excessively difficult.
|
That paper, I discovered in a brief Google search, can be found from:
http://www.cag.lcs.mit.edu/scale/hat/
The Google hit for the paper itself doesn't appear to be quite
workable, but this page has a good link on it in an obvious place.
| Quote: | (Since a new ISA in this target market has minimal chance of
being implemented [especially an ISA developed by a mere
technophile], this is a highly theoretical question.)
|
For something that truly has no chance of being implemented, and is not
the product of serious research, try my page at
http://www.quadibloc.com/arch/arcint.htm
I started out with a few pages describing a simple architecture
inspired by the 68000 and the System/360, which was oriented towards a
compact instruction encoding from the beginning. But to add ways of
making the instruction format more complex, I wound up adding a
plethora of alternate addressing modes, struggling to squeeze more in.
(Also, I added just about every unusual feature ever known to the world
of computing, so as to have an opportunity to explain how each one
works...)
John Savard |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Jul 14, 2005 4:15 pm Post subject:
Re: Code density and performance? |
|
|
jon@beniston.com wrote:
| Quote: | Heidi Pan's "High Performance, Variable-Length Instruction
Encodings" implies that a 25% reduction in code size relative to
MIPS should be possible without making a superscalar
implementation excessively difficult.
You can get a 40% reduction with just a mix of 16 and 32-bit
instructions, without harming performance or making superscalar
execution more difficult than it otherwise would be. And yes, for
embedded apps, 40% reduction in code size is well worth it, that's why
we have MIPS-16, Thumb, ARCcompact etc.
|
IIRC, according to the paper, MIPS-16 suffers on the order
of a 20% performance penalty (presumably mainly from the extra
instructions caused by the smaller immediately available
register set, though the slight lengthening of the pipeline
might have a modest performance impact as well). Any variable
length encoding adds complexity to fetch/decode, especially
for a superscalar implementation. (It might be debated that
any performance-oriented implementation would already be so
complex that some extra decode complexity might be accepted.)
Note also that the target market did not include the lower-end
embedded market where Thumb et al. are so attractive. The
question remains: Would a modest improvement in code density
improve performance enough to justify the extra effort in ISA
design (and whatever [hopefully minimized] additional effort
in implementation)?
Paul A. Clayton
not speaking for Big Sam's Anatomical Warehouse
http://home.earthlink.net/~paaronclayton/Big_Sams |
|
| Back to top |
|
 |
Tom Linden
Guest
|
Posted:
Thu Jul 14, 2005 4:15 pm Post subject:
Re: Code density and performance? |
|
|
On 14 Jul 2005 12:13:37 +0200, Torben Ęgidius Mogensen
<torbenm@app-4.diku.dk> wrote:
| Quote: | nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
In article <1121333201.729594.240150@g47g2000cwa.googlegroups.com>,
"jon@beniston.com" <jon@beniston.com> writes:
|> > Heidi Pan's "High Performance, Variable-Length Instruction
|> > Encodings" implies that a 25% reduction in code size relative to
|> > MIPS should be possible without making a superscalar
|> > implementation excessively difficult.
|
|> You can get a 40% reduction with just a mix of 16 and 32-bit
|> instructions, without harming performance or making superscalar
|> execution more difficult than it otherwise would be. And yes, for
|> embedded apps, 40% reduction in code size is well worth it, that's
why
|> we have MIPS-16, Thumb, ARCcompact etc.
In the past, I have estimated that you could get a 50-75% reduction
by a two-level ISA, where the top level was designed for the high
level language. Designing ISAs for ease of code generation was
first proposed in the late 1960s, as far as I know, but was never
done in a mainstream system,
You can argue that Digital's VAX did this.
|
FWIW, our PL/I on Alpha generates executables ~ 3 times larger than it
does on VAX, which implies the need for larger cache and more bandwidth
on the memory cache channel.
| Quote: |
and hasn't been attempted at all in recent decades.
Mainly because, at this level, code generation is fairly trivial.
In order to do this, you need a microcode approach, possibly with
a programmable microcode. In the heyday of the RISC dogma, this
was stated to be incompatible with performance, but the Pentium 4
and Opteron have shown that that is now false, if it ever were
true (which is very doubtful).
70's minicomputers often used interpreted microcode, which is bad for
performance. What you see in P4 etc. is compiled microcode where each
instruction is decoded by a combinatoric circuit and translated into
microinstructions all in hardware. This is quite different from
microcode that would decode instructions in software (or "firmware",
as it was often called).
In any case, there are many simple ways to increase code density
without seriously complicating implementation or impacting
performance:
- Use two-address instructions, i.e., A := A+B instead of C := A+B.
Coalescing register allocators can in most cases eliminate the
extra moves required by two-address code (assuming you have more
than a handful of available registers).
- Use mixed-length instructions. If you make sure no instruction
spans a word boundary, this isn't too difficult to handle. This
can be handled, e.g., by using only a few different ways to pack
instruction into a word, for example, either one 31 bit
instruction, two 15-bit instructions or three 10-bit instructions
into one 32-bit word (using the first one or two bits to select the
format).
- Provide instructions to load/store an interval/set of registers on
stack to reduce size of procedure prologues/epilogues.
Torben
|
|
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Jul 14, 2005 4:15 pm Post subject:
Re: Code density and performance? |
|
|
Torben Ęgidius Mogensen wrote:
[snip]
| Quote: | In any case, there are many simple ways to increase code density
without seriously complicating implementation or impacting
performance:
- Use two-address instructions, i.e., A := A+B instead of C := A+B.
Coalescing register allocators can in most cases eliminate the
extra moves required by two-address code (assuming you have more
than a handful of available registers).
|
The cited paper proposed this also, though it used a 3-operand
encoding when appropriate to avoid the extra move instruction.
| Quote: | - Use mixed-length instructions. If you make sure no instruction
spans a word boundary, this isn't too difficult to handle. This
can be handled, e.g., by using only a few different ways to pack
instruction into a word, for example, either one 31 bit
instruction, two 15-bit instructions or three 10-bit instructions
into one 32-bit word (using the first one or two bits to select the
format).
|
This is something like the 64-bit instruction word ISA I was
thinking about except that I was thinking of making the
instruction boundaries less clear. E.g., a position in the word
could be reserved for a source register ID where only later
decoding would indicate which operation used that register.
Since the ISA would be targeted to performance-oriented
implementations, having a 64-bit minimum fetch width is
reasonable and such allows for more flexibility in encoding
instructions (including long immediates [I was thinking of
providing a word-spanning operation to provide true 64-bit
immediates primarily to load constant addresses. This
might also be extended to allow the fetch bandwidth to
provide a data stream with a tiny side buffer providing a
loop body of code, though such is probably not worth the
significant complexity that it would add.]).
| Quote: | - Provide instructions to load/store an interval/set of registers on
stack to reduce size of procedure prologues/epilogues.
|
Optimizing procedure overhead is an attractive idea,
but I am concerned that adding such would increase hardware
complexity and pipeline length. Load/store register pair
seemed a weak compromise that seemingly would be simpler
to implement and could take advantage of cache structure.
Thanks for the comments.
Paul A. Clayton
not speaking for Big Sam's Anatomical Warehouse
http://home.earthlink.net/~paaronclayton/Big_Sams |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Jul 14, 2005 4:15 pm Post subject:
Re: Code density and performance? |
|
|
I wrote:
In reference to Heidi Pan's invention, I have made modifications to the
pages
http://www.quadibloc.com/arch/ar507.htm
http://www.quadibloc.com/arch/ar502.htm
in order to add her idea (with modifications for application to a
pre-existing CISC architecture, rather than a customized
variable-length modification of a pre-existing RISC architecture) to my
architecture.
John Savard |
|
| Back to top |
|
 |
Torben Ęgidius Mogensen
Guest
|
Posted:
Thu Jul 14, 2005 4:15 pm Post subject:
Re: Code density and performance? |
|
|
Dysthymicdolt@aol.com writes:
| Quote: | Torben Ęgidius Mogensen wrote:
- Provide instructions to load/store an interval/set of registers on
stack to reduce size of procedure prologues/epilogues.
Optimizing procedure overhead is an attractive idea,
but I am concerned that adding such would increase hardware
complexity and pipeline length.
|
It has been a part of the ARM ISA since ARM1, so it can't be that
difficult to implement. Nor does it have to affect pipeline length,
though it may require several cycles to complete. In addition to
working well with cache, it also exploits sequential memory access.
Torben |
|
| Back to top |
|
 |
jon@beniston.com
Guest
|
Posted:
Thu Jul 14, 2005 4:15 pm Post subject:
Re: Code density and performance? |
|
|
MIPS-16 / Thumb aren't the only way to do it. There are architectures
that support intermixed 16/32-bit instructions. These don't have the
same performance penalty.
A decrease in code size would most likely lead to an increase in
instruction cache hit rate. This obviously depends upon the
implementation of other microarchitectural features such as prefetch
though.
Cheers,
Jon |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Jul 14, 2005 9:27 pm Post subject:
Re: Code density and performance? |
|
|
Torben Ęgidius Mogensen wrote:
| Quote: | Dysthymicdolt@aol.com writes:
[snip]
Optimizing procedure overhead is an attractive idea,
but I am concerned that adding such would increase hardware
complexity and pipeline length.
It has been a part of the ARM ISA since ARM1, so it can't be that
difficult to implement. Nor does it have to affect pipeline length,
though it may require several cycles to complete. In addition to
working well with cache, it also exploits sequential memory access.
|
But ARM has also been very slow to go superscalar relative
to MIPS, PowerPC, and Alpha. (There are other reasons for
this, but I suspect that the ISA makes a superscalar
implementation more challenging.) Cracking and millicoding
(to use POWER4 terminology) requires extra work which might
not be easily done in parallel with other, necessary
decoding work. (32b PowerPC also has load/store multiple,
which loads/stores Rn through R31 using any GPR for the base
address register.)
Paul A. Clayton
http://home.earthlink.net/~paaronclayton |
|
| Back to top |
|
 |
Dennis M. O'Connor
Guest
|
Posted:
Thu Jul 14, 2005 9:32 pm Post subject:
Re: Code density and performance? |
|
|
""Torben Ęgidius Mogensen"" <torbenm@app-4.diku.dk> wrote in ...
| Quote: | Dysthymicdolt@aol.com writes:
Torben Ęgidius Mogensen wrote:
- Provide instructions to load/store an interval/set of registers on
stack to reduce size of procedure prologues/epilogues.
Optimizing procedure overhead is an attractive idea,
but I am concerned that adding such would increase hardware
complexity and pipeline length.
It has been a part of the ARM ISA since ARM1, so it can't be that
difficult to implement.
|
That's very poor reasoning indeed. Some ISA features are easy
in primitive systems that lack caches, out-of-order completion,
split transaction buses and MMUs, but become a pain in the ass
to get right when these things are introduced. For example,
in a non-MMU system, you don't have to worry about a page fault
occurring in the middle of your LDM/STM, which is a PITA.
LDM/STM is not too hard, but there are a lot of things you
have to keep in mind to make sure it gets done in the face
of all the events that might occur during its execution, and
you do need to add a little hardware to the pipeline to do it.
One of the motivational philosophies of RISC was to avoid
things that added complexity without adding significant
performance. Given the high hit rate of instruction caches,
you can argue that LDM/STM may not be a win compared to
a sequence of, say, LDRD/STRD (double-word load/stores),
especially if your ISA is efficiently encoded (like IA32 and Thumb).
And then you go off and simulate that, for the particular
implementation technology of the say, and see.
On a related note, one mistake naive architects often make
is thinking that the "goodness" of a particular ISA feature
is independent of the underlying implementation technology.
This is demonstrably untrue.
--
Dennis M. O'Connor dmoc@primenet.com |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Jul 14, 2005 10:00 pm Post subject:
Re: Code density and performance? |
|
|
Tom Linden wrote:
[snip]
| Quote: | FWIW, our PL/I on Alpha generates executables ~ 3 times larger than it
does on VAX, which implies the need for larger cache and more bandwidth
on the memory cache channel.
|
Was this before the addition of byte and dual-byte load/store
instructions? Also, ISTR that _optimized_ Alpha code
included a lot of padding no-ops (the later implementations
liked to have branch targets 4-instruction aligned, IIRC,
and there might have been some other reasons for adding no-ops
such as to avoid having too many branches in a fetch block).
Other code-expanding optimizations like loop unrolling might
have been less effective on a VAX (because of a smaller register
count, less ability to schedule code, or even just greater cost
of larger code) and so not considered appropriate for VAX by the
compiler writers.
I suspect that Alpha executables were rarely optimized for
size (and that size-optimization was rarely a performance
benefit [relative to the loss from dropping padding no-ops
et al.] given a wide memory interface and large off-chip cache).
Also the weak support for large-branch-count codes in the EV6
branch predictor might imply that cache-busting commercial codes
were not a principal concern to the Alpha designers. OTOH,
perhaps later cache optimizers more carefully balanced code
size and ideal code scheduling?
Paul A. Clayton
http://home.earthlink.net/~paaronclayton/ |
|
| Back to top |
|
 |
|
|
|
|