| Author |
Message |
John Savard
Guest
|
Posted:
Sat Jan 15, 2005 10:44 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
On Thu, 13 Jan 2005 01:46:34 GMT, Maynard Handley <name99@name99.org>
wrote, in part:
| Quote: | You obviously have never programmed AltiVec, have you, Nick?
While I understand why AltiVec does not allow for unaligned accesses,
and accept that it may well have been and continue to be the correct
tradeoff, the fact is that it is a pain to deal with.
|
AltiVec is a feature similar to MMX.
It works with small vectors which contain several items of a given data
type.
It certainly is true that forcing these vectors to be aligned on a
256-bit boundary will impact many perfectly legitimate programming
operations.
But that doesn't change the fact that it is very seldom necessary to
allow a 64-bit floating-point number to start on an odd 32-bit boundary,
and so on. If one has a compressed record format that includes 32-bit
integer fields starting at odd bytes, one just uses byte instructions to
construct the records.
Putting only a few extra gates on a chip to allow unaligned accesses,
and then warning programmers that these accesses will have a performance
penalty, so they should not be used unless really needed, is usually the
best tradeoff, though. It eliminates a potential source of confusion and
error at the lowest cost.
Pipelined arithmetic units allow for vector operations which allow
overlapped, rather than simultaneous, operation on successive vector
elements. The Cray and its predecessors are examples of this. While
there's nothing wrong with having a parallel vector unit as well, it can
be pipelined too, and vectorized as well: that is, given vector
instructions that act on vectors whose length is a multiple of the
length of the vectors on which it operates as elementary units.
Thus, when the fast wide arithmetic unit won't do, just use a vector
instruction on the slow narrow arithmetic unit. Since they're two
different arithmetic units, they could even be running at the same time,
so that rather than having fewer FLOPS by using the slower arithmetic
unit occasionally, one ends up with more FLOPS!
John Savard
http://home.ecn.ab.ca/~jsavard/index.html |
|
| Back to top |
|
 |
MrTibbs
Guest
|
Posted:
Sun Jan 16, 2005 7:55 am Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
| Quote: | Putting only a few extra gates on a chip to allow unaligned accesses,
|
I'm not so sure it's just a few gates. What if the unaligned access
crosses a cache line boundary, and one line is in the cache and one
isn't? What if it crosses a page boundary, and blah blah...
There's the MOESI/whatever protocol for multiprocessors as well.
Although few programs may do unaligned accesses on shared memory, it
has to work right if it is advertised.
It may or may not be a few gates, but I think the hardware folks, with
unaligned accesses, now have to deal with a whole bunch of corner cases
that they wouldn't be considered otherwise.
jim |
|
| Back to top |
|
 |
MrTibbs
Guest
|
Posted:
Sun Jan 16, 2005 7:55 am Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
Maynard Handley wrote:
| Quote: | [snip]
My whole point was that the specific nature of these codecs (for
example
the way that H264 breaks the image up into variable sized blocks
which
can be as small as 4x4) means that however you slice and dice the
problem (and you have complete control over the memory structures ---
these are all internal) you're going to spend a lot of your time
wanting
to load vectors that are not aligned to a multiple of 16.
|
True, but with a helpful ISA I would *ALWAYS* prefer to do aligned
loads. First, it's nice to have an aligned load instruction that
tolerates an unaligned address (i.e. it masks the least significant
address bits and does the appropriate aligned load). This saves the
programmer from having to do the address mask operation explicitly.
Then, after you do all the aligned loads you need, all you need is some
kind merging instruction that takes two registers and extracts the
desired word to the result register. Yes, this takes an extra
instruction, but I'm guessing that in unaligned capable hardware the
cache requires an extra cycle or two to pretty much do the same thing.
I don't know any common ISA that offers both of these features, but I
know a number do one or the other.
Jim |
|
| Back to top |
|
 |
Maynard Handley
Guest
|
Posted:
Sun Jan 16, 2005 6:24 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
In article <1105855140.194975.134600@f14g2000cwb.googlegroups.com>,
"MrTibbs" <jim2101024@comcast.net> wrote:
| Quote: | Maynard Handley wrote:
[snip]
My whole point was that the specific nature of these codecs (for
example
the way that H264 breaks the image up into variable sized blocks
which
can be as small as 4x4) means that however you slice and dice the
problem (and you have complete control over the memory structures ---
these are all internal) you're going to spend a lot of your time
wanting
to load vectors that are not aligned to a multiple of 16.
True, but with a helpful ISA I would *ALWAYS* prefer to do aligned
loads. First, it's nice to have an aligned load instruction that
tolerates an unaligned address (i.e. it masks the least significant
address bits and does the appropriate aligned load). This saves the
programmer from having to do the address mask operation explicitly.
Then, after you do all the aligned loads you need, all you need is some
kind merging instruction that takes two registers and extracts the
desired word to the result register. Yes, this takes an extra
instruction, but I'm guessing that in unaligned capable hardware the
cache requires an extra cycle or two to pretty much do the same thing.
I don't know any common ISA that offers both of these features, but I
know a number do one or the other.
Jim
|
What you have described is exactly what AltiVec DOES.
Yes it works well, no-one is denying that.
BUT
- it is one more pain you have to deal with and
- it is hard to get the compiler to handle this stuff automatically in a
way that doesn't waste a lot of cycles because there is no natural way
to annotate for its benefit what addresses one expects to be aligned and
which one does not. If your CPU is capable of, say, 2G vector ops/sec
and you are content to get 500M, this may not matter much. But if you
are striving to get 1.8 because you have a real performance goal, this
puts a real crimp on how much you can just hand over to the compiler.
Maynard |
|
| Back to top |
|
 |
Anton Ertl
Guest
|
Posted:
Sun Jan 16, 2005 8:51 pm Post subject:
Re: RISC vs. CISC design principles |
|
|
Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
| Quote: | Anton Ertl wrote:
Regarding the latency, if the CPU internally combines the MOV followed
by the two-address instruction into a three-address micro-instruction,
I would expect the same latency in the execution engine.
Such an engine _must_ be capable of detecting the idiom of moving
something away, doing something to the original, and then update the
copy: This is what good compilers will and should generate in order to
minimize latency on as many cpu implementations as possible.
|
You may wish that, but e.g., Intel seems to favour creating
microarchitectures that run existing code correctly, but perform
better on code optimized for the microarchitecture; you probably can
name examples for that better than I can. If they see that they can
get most the performance of three-address code with an appropriate
compiler from just combining the "x=y; x+=z" idiom, and if detecting
your idiom would have a significant extra cost, they probably won't do
it.
As an example, the P5 microarchitecture can optimize sequences like
"FADD; FXCH; FADD" into pipelined FP code; this requires very specific
code sequences; this style was not helpful or recommended on the 486,
and not on the P6, either.
As for compilers, they often have options for optimizing for specific
microarchitectures, or to generate more general code. The compiler
will optimize for whatever you ask it to optimize for. Once the new
microarchitecture is on the market for some time, compilers will have
special optimizations for it, and if it is the most popular
microarchitecture, users will usually optimize for that, certainly not
for as many as possible (e.g., people don't optimize PC software for
the 386, 486 or P5 microarchitectures any more).
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
| Back to top |
|
 |
David Wang
Guest
|
Posted:
Sun Jan 16, 2005 9:13 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
Christian Bau <christian.bau@cbau.freeserve.co.uk> wrote:
| Quote: | The complaint is that you actually have to do this. If an unaligned load
where available, even if it is a cycle slower, things would still be
quicker, and writing code would be much easier.
|
Some recent discussions on RWT lead to the re-discovery of the 2003
Motorola document that reported some SPEC CPU 2000 numbers for the
PPC 7455.
http://www.freescale.com/files/sndf/doc/reports_presentations/SNDF2003_EUROPE_H1107.pdf
On the same slides, pp 32~33 seems to suggest that non-alignment causes
some really serious problems for PPC 74xx series of processors, at least
as far as gcc is concerned. It seems to be quite a bit larger than just
one cycle or two. Do you have some thoughts on these data?
--
davewang202(at)yahoo(dot)com |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Mon Jan 17, 2005 12:45 am Post subject:
Re: RISC vs. CISC design principles |
|
|
Anton Ertl wrote:
| Quote: | Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
Anton Ertl wrote:
Regarding the latency, if the CPU internally combines the MOV followed
by the two-address instruction into a three-address micro-instruction,
I would expect the same latency in the execution engine.
Such an engine _must_ be capable of detecting the idiom of moving
something away, doing something to the original, and then update the
copy: This is what good compilers will and should generate in order to
minimize latency on as many cpu implementations as possible.
You may wish that, but e.g., Intel seems to favour creating
microarchitectures that run existing code correctly, but perform
better on code optimized for the microarchitecture; you probably can
name examples for that better than I can. If they see that they can
get most the performance of three-address code with an appropriate
compiler from just combining the "x=y; x+=z" idiom, and if detecting
your idiom would have a significant extra cost, they probably won't do
it.
|
Anton, I was going to write a strong rebuttal, but then I realized that
you're almost certainly correct. :-(
MOV immediately followed by OP to the same target register does seem
like a more reasonable target for a pipeline-optimizing decoder. :-)
| Quote: |
As an example, the P5 microarchitecture can optimize sequences like
"FADD; FXCH; FADD" into pipelined FP code; this requires very specific
code sequences; this style was not helpful or recommended on the 486,
and not on the P6, either.
|
Actually, it didn't penalize either 486 or P6 much, in fact you do need
some way to tell even a PIII or P4 that multiple stack-based fp
operations are independent. The proper solution on a P4 is to just
forget about the x87 fpu and use SSE instead, but if you need to stay
backwards compatible, a few FXCH opcodes is still a good idea.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Mon Jan 17, 2005 12:55 am Post subject:
Re: RISC vs. CISC design principles |
|
|
Anton Ertl wrote:
| Quote: | You may wish that, but e.g., Intel seems to favour creating
microarchitectures that run existing code correctly, but perform
better on code optimized for the microarchitecture; you probably can
name examples for that better than I can. If they see that they can
get most the performance of three-address code with an appropriate
compiler from just combining the "x=y; x+=z" idiom, and if detecting
your idiom would have a significant extra cost, they probably won't do
it.
|
I have to write another reply to this:
What is wrong with using the register renaming unit to remove reg-reg
MOVes completely, by treating them like read-only shared pages in an
operating system?
Since the usual approach is to turn all instructions into three-operand
micro-ops that write to a new unique location anyway, it would seem like
such MOV opcodes really doesn't do anything at all, except updating a
scoreboard to remember which architeced register is stored in which
physical reg.
I can see how this could create a few more problems in case of a
trap/interrupt, but otherwise it seems like such an obvious idea that
there must be something wrong with it?
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Mon Jan 17, 2005 1:10 am Post subject:
Re: RISC vs. CISC design principles |
|
|
Terje Mathisen wrote:
| Quote: | I can see how this could create a few more problems in case of a
trap/interrupt, but otherwise it seems like such an obvious idea that
there must be something wrong with it?
|
Interrupts are not a problem - you always can defer interrupts to the next
save point where an interruption doesn't cause much harm (checkpoint time,
e.g. branches). Exceptions, as done in the P4, also don't really cause a
problem. The algorithm is simple:
* revert to the last checkpoint
* change the decode/scheduler algorithm to "single, in-order issue"
* single step through until you fail again
* back out the last, failed instruction
* load the trap descriptor, and turn the decode/scheduler algorithm again to
"normal"
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
Christian Bau
Guest
|
Posted:
Mon Jan 17, 2005 1:12 am Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
In article <1105855140.194975.134600@f14g2000cwb.googlegroups.com>,
"MrTibbs" <jim2101024@comcast.net> wrote:
| Quote: | True, but with a helpful ISA I would *ALWAYS* prefer to do aligned
loads. First, it's nice to have an aligned load instruction that
tolerates an unaligned address (i.e. it masks the least significant
address bits and does the appropriate aligned load). This saves the
programmer from having to do the address mask operation explicitly.
Then, after you do all the aligned loads you need, all you need is some
kind merging instruction that takes two registers and extracts the
desired word to the result register. Yes, this takes an extra
instruction, but I'm guessing that in unaligned capable hardware the
cache requires an extra cycle or two to pretty much do the same thing.
I don't know any common ISA that offers both of these features, but I
know a number do one or the other.
|
You described quite precisely what Altivec does. You can do exactly the
things that you describe. If you don't know that a pointer p is aligned
properly, then to read sixteen bytes starting at p you use the
instruction
reg1 = "generate permutation vector" (p);
reg2 = load_aligned (p);
reg3 = load_aligned (p + 16);
reg4 = vector_permute (reg2, reg3, reg1);
In a loop where you read consecutive 16 byte vectors, the permutation
vector in reg1 is unchanged, so you just write
reg2 = reg3
reg3 = load_aligned (p+32...)
reg4 = vector_permute (reg2, reg3, reg1);
The complaint is that you actually have to do this. If an unaligned load
where available, even if it is a cycle slower, things would still be
quicker, and writing code would be much easier. |
|
| Back to top |
|
 |
Niels Jørgen Kruse
Guest
|
Posted:
Mon Jan 17, 2005 1:48 am Post subject:
Re: Unaligned accesses |
|
|
Christian Bau <christian.bau@cbau.freeserve.co.uk> wrote:
| Quote: | The complaint is that you actually have to do this. If an unaligned load
where available, even if it is a cycle slower, things would still be
quicker, and writing code would be much easier.
|
If the unaligned load caused a replay whenever crossing a lineboundary,
you wouldn't want it anyway.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark |
|
| Back to top |
|
 |
Kai Harrekilde-Petersen
Guest
|
Posted:
Mon Jan 17, 2005 3:50 am Post subject:
Re: RISC vs. CISC design principles |
|
|
nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
| Quote: | In article <m3acrez0xi.fsf@averell.firstfloor.org>,
Andi Kleen <freitag@alancoxonachip.com> wrote:
nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
In my career, I have never seen a significant use of it except to
cover up misdesigned interfaces - in particular, ones that have
failed to take the decision whether they are based on semi-abstract
types like integers and floating-point or on precisely specified
bit patterns.
It's useful to process IPv4 packets. On a aligned ethernet packet the
TCP header ends up being unaligned. Same is true for other protocols.
That is precisely what I am describing as a misdesigned protocol.
|
Are you poking at the 14 byte Ethernet header or the IPv4 header here?
- I thought the IPv4 header was quite well-laid out, with everything
aligned to natural boundaries.
Regards,
Kai
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk> |
|
| Back to top |
|
 |
John Savard
Guest
|
Posted:
Mon Jan 17, 2005 5:25 am Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
On 15 Jan 2005 22:05:14 -0800, "MrTibbs" <jim2101024@comcast.net> wrote,
in part:
| Quote: | Putting only a few extra gates on a chip to allow unaligned accesses,
I'm not so sure it's just a few gates. What if the unaligned access
crosses a cache line boundary, and one line is in the cache and one
isn't? What if it crosses a page boundary, and blah blah...
|
You make it into a few gates by turning an unaligned access into
multiple accesses of smaller things. If you want a smaller performance
penalty, *then* it's more gates.
John Savard
http://home.ecn.ab.ca/~jsavard/index.html |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Mon Jan 17, 2005 7:55 am Post subject:
Re: RISC vs. CISC design principles |
|
|
Bernd Paysan wrote:
| Quote: | Terje Mathisen wrote:
I can see how this could create a few more problems in case of a
trap/interrupt, but otherwise it seems like such an obvious idea that
there must be something wrong with it?
Interrupts are not a problem - you always can defer interrupts to the next
save point where an interruption doesn't cause much harm (checkpoint time,
e.g. branches). Exceptions, as done in the P4, also don't really cause a
problem. The algorithm is simple:
* revert to the last checkpoint
* change the decode/scheduler algorithm to "single, in-order issue"
* single step through until you fail again
* back out the last, failed instruction
* load the trap descriptor, and turn the decode/scheduler algorithm again to
"normal"
|
Right.
Could it be a problem that reg-reg moves happens so often, and
often/usually overlap in the lifetime of the results?
This could easily reduce the number of 'natural' checkpoints (i.e.
places where every architected register is stored in a separate rename
register) to zero. :-(
To avoid this problem, it does seem to me like you would need some
special mechanism to clean up overlapping register allocations. The
easiest might be to always keep the architected register slots free, and
copy them all in case of a trap. Another approach is to run the renamer
as today, including when doing MOVs, but maintain an extra alias
(pending MOV) table to remember the suppressed operations.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Emil Naepflein
Guest
|
Posted:
Mon Jan 17, 2005 7:55 am Post subject:
Re: Unaligned accesses |
|
|
Niels Jørgen Kruse wrote:
| Quote: | Christian Bau <christian.bau@cbau.freeserve.co.uk> wrote:
The complaint is that you actually have to do this. If an unaligned load
where available, even if it is a cycle slower, things would still be
quicker, and writing code would be much easier.
If the unaligned load caused a replay whenever crossing a lineboundary,
you wouldn't want it anyway.
|
Yes, especially cache coherency may cause a lot of headache.
Emil
--
Philosys Software GmbH System Software Phone: +49 89/321407-40
Edisonstrasse 6 is our Fax: +49 89/321407-12
85716 Unterschleissheim Speciality EMail: egn@philosys.de
Germany WWW: www.philosys.de |
|
| Back to top |
|
 |
|
|
|
|