| Author |
Message |
Niels Jørgen Kruse
Guest
|
Posted:
Mon Jan 17, 2005 1:42 pm Post subject:
Re: Unaligned accesses |
|
|
David Wang <foo@bar.invalid> wrote:
| Quote: | On the same slides, pp 32~33 seems to suggest that non-alignment causes
some really serious problems for PPC 74xx series of processors, at least
as far as gcc is concerned. It seems to be quite a bit larger than just
one cycle or two. Do you have some thoughts on these data?
|
It seems to be a memcpy implementation that drops back to byte at a time
on unaligned buffers. Look at the Bytes/Loop row.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Mon Jan 17, 2005 2:19 pm Post subject:
Re: Unaligned accesses |
|
|
In article <naomu0p6eh4bvafkht1auvfj6m0p7hth4v@4ax.com>,
Emil Naepflein <Emil.Naepflein@philosys.de> wrote:
| Quote: | Niels Jørgen Kruse wrote:
Christian Bau <christian.bau@cbau.freeserve.co.uk> wrote:
The complaint is that you actually have to do this. If an unaligned load
where available, even if it is a cycle slower, things would still be
quicker, and writing code would be much easier.
If the unaligned load caused a replay whenever crossing a lineboundary,
you wouldn't want it anyway.
Yes, especially cache coherency may cause a lot of headache.
|
It gets worse - think TLBs on SMP systems. There have been reports of
some systems where there were not enough TLB entries to handle the
worst possible cases, which therefore could never complete! Whether
or not that is so, think of the following nightmare scenario:
On an SMP system, a communication word (say one that is written from
one thread and read from the others) is unaligned. Now work out
what happens with the TLB misses and how much this will increase
the chances of the two halves of the word being inconsistent when
they are read.
The alternative is, of course, that the system has to support
multi-TLB lockdown, of some sort, which is not a pretty concept.
Remembering that it isn't even running privileged code, but a normal
application!
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Emil Naepflein
Guest
|
Posted:
Mon Jan 17, 2005 3:15 pm Post subject:
Re: Unaligned accesses |
|
|
Nick Maclaren wrote:
| Quote: | It gets worse - think TLBs on SMP systems. There have been reports of
some systems where there were not enough TLB entries to handle the
worst possible cases, which therefore could never complete!
|
Then this is bad programming of the TLB handling - if handling is in
software. ;-)
| Quote: | On an SMP system, a communication word (say one that is written from
one thread and read from the others) is unaligned. Now work out
what happens with the TLB misses and how much this will increase
the chances of the two halves of the word being inconsistent when
they are read.
|
It would be braindead from an hardware designer to place a device
register unaligned into the address space. And software is normally
allowed only to access device register one after each other with the
correct size. So this case should never be real.
| Quote: | The alternative is, of course, that the system has to support
multi-TLB lockdown, of some sort, which is not a pretty concept.
Remembering that it isn't even running privileged code, but a normal
application!
|
Some processors support hard-wired TLB entries, but this are used for OS
only.
We certainly agree that the support of transparent unaligned access in
cache-coherent SMP systems is a "Pandora Box" which should never be
opened. ;-)
Emil
--
Philosys Software GmbH System Software Phone: +49 89/321407-40
Edisonstrasse 6 is our Fax: +49 89/321407-12
85716 Unterschleissheim Speciality EMail: egn@philosys.de
Germany WWW: www.philosys.de |
|
| Back to top |
|
 |
Emil Naepflein
Guest
|
Posted:
Mon Jan 17, 2005 3:24 pm Post subject:
Re: Unaligned accesses |
|
|
Niels Jørgen Kruse wrote:
| Quote: | David Wang <foo@bar.invalid> wrote:
On the same slides, pp 32~33 seems to suggest that non-alignment causes
some really serious problems for PPC 74xx series of processors, at least
as far as gcc is concerned. It seems to be quite a bit larger than just
one cycle or two. Do you have some thoughts on these data?
It seems to be a memcpy implementation that drops back to byte at a time
on unaligned buffers. Look at the Bytes/Loop row.
|
You may do better if you realign and merge the data in processor
registers. The processing may be hidden in the memory delay.
Emil
--
Philosys Software GmbH System Software Phone: +49 89/321407-40
Edisonstrasse 6 is our Fax: +49 89/321407-12
85716 Unterschleissheim Speciality EMail: egn@philosys.de
Germany WWW: www.philosys.de |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Mon Jan 17, 2005 3:33 pm Post subject:
Re: RISC vs. CISC design principles |
|
|
Terje Mathisen wrote:
| Quote: | Could it be a problem that reg-reg moves happens so often, and
often/usually overlap in the lifetime of the results?
This could easily reduce the number of 'natural' checkpoints (i.e.
places where every architected register is stored in a separate rename
register) to zero. :-(
|
No, there's no need to "unalias" physical registers. It's perfectly save to
checkpoint a situation like
ax -> 5
bx -> 5
cx -> 3
dx -> 4
si -> 3
di -> 5
bp -> 6
sp -> 6
The architectural register group ax, bx, and di, as well as the group sp and
bp *do* contain the same value, so having the same address doesn't hurt
(it's a single static assigned address, it will not be overwritten as long
as it is alive).
The problematic point is an instruction followed by a move - you can't use
that as checkpoint. But you *can* use the mov instruction itself as
checkpoint, and tie it to the following instruction when the mov
instruction vanishes. So when that instruction fails, the moves before are
backed out, too. Reverting to single step then can reveal the real failure
source.
Transmeta still generates a non-renamed set of architectural registers at
checkpoints, though everything in between can be as the code morpher
decides. That's necessary because there's no register renaming. It's also
necessary to keep the moves "real" on the Athlon/Opteron architecture,
since there also is a non-renamed architectural register file (future/past
file), though the on-flight instructions do have something that is similar
to register renaming.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Mon Jan 17, 2005 4:03 pm Post subject:
Re: Unaligned accesses |
|
|
In article <8k3nu0prh2sc4p6hfrveau1jq4bnglo0id@4ax.com>, Emil Naepflein <Emil.Naepflein@philosys.de> writes:
|> Nick Maclaren wrote:
|>
|> > It gets worse - think TLBs on SMP systems. There have been reports of
|> > some systems where there were not enough TLB entries to handle the
|> > worst possible cases, which therefore could never complete!
|>
|> Then this is bad programming of the TLB handling - if handling is in
|> software. ;-)
No, it isn't. There should not be a need for TLB handling to have
to emulate instructions, to execute one in real=virtual mode or to
have to fail with 'instruction too complex'. And those are the only
three solutions, once you are in that hole. The solution is not to
go there in the first place.
|> > On an SMP system, a communication word (say one that is written from
|> > one thread and read from the others) is unaligned. Now work out
|> > what happens with the TLB misses and how much this will increase
|> > the chances of the two halves of the word being inconsistent when
|> > they are read.
|>
|> It would be braindead from an hardware designer to place a device
|> register unaligned into the address space. And software is normally
|> allowed only to access device register one after each other with the
|> correct size. So this case should never be real.
Please read the above again. I am not referring to device registers.
I am referring to a shared memory location - nothing more.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Emil Naepflein
Guest
|
Posted:
Mon Jan 17, 2005 4:24 pm Post subject:
Re: Unaligned accesses |
|
|
Nick Maclaren wrote:
| Quote: | Please read the above again. I am not referring to device registers.
I am referring to a shared memory location - nothing more.
|
Sorry, you are right.
Emil
--
Philosys Software GmbH System Software Phone: +49 89/321407-40
Edisonstrasse 6 is our Fax: +49 89/321407-12
85716 Unterschleissheim Speciality EMail: egn@philosys.de
Germany WWW: www.philosys.de |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Mon Jan 17, 2005 5:52 pm Post subject:
Re: RISC vs. CISC design principles |
|
|
Bernd Paysan wrote:
| Quote: | Terje Mathisen wrote:
Could it be a problem that reg-reg moves happens so often, and
often/usually overlap in the lifetime of the results?
This could easily reduce the number of 'natural' checkpoints (i.e.
places where every architected register is stored in a separate rename
register) to zero. :-(
No, there's no need to "unalias" physical registers. It's perfectly save to
checkpoint a situation like
ax -> 5
bx -> 5
cx -> 3
dx -> 4
si -> 3
di -> 5
bp -> 6
sp -> 6
The architectural register group ax, bx, and di, as well as the group sp and
bp *do* contain the same value, so having the same address doesn't hurt
(it's a single static assigned address, it will not be overwritten as long
as it is alive).
The problematic point is an instruction followed by a move - you can't use
that as checkpoint. But you *can* use the mov instruction itself as
checkpoint, and tie it to the following instruction when the mov
instruction vanishes. So when that instruction fails, the moves before are
backed out, too. Reverting to single step then can reveal the real failure
source.
|
In which case my original question returns:
Why aren't Intel/AMD doing this already?
If a hw trap can modify the architectural registers (on the stack)
before returning, this still works because each POP reg instruction will
target a new rename register, right?
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Mon Jan 17, 2005 6:29 pm Post subject:
Re: Unaligned accesses |
|
|
Nick Maclaren wrote:
| Quote: | On an SMP system, a communication word (say one that is written from
one thread and read from the others) is unaligned. Now work out
what happens with the TLB misses and how much this will increase
the chances of the two halves of the word being inconsistent when
they are read.
|
IMHO, using a LOCK prefix (implied or explicitely) on a misaligned data
item should trap even on a cpu which supports unaligned accesses.
It should do so even in the cases that would be easy to support: Single
cpu system, with both halves in L1 cache or even within the same cache line.
| Quote: | The alternative is, of course, that the system has to support
multi-TLB lockdown, of some sort, which is not a pretty concept.
Remembering that it isn't even running privileged code, but a normal
application!
|
I know, it is pretty bad. :-(
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Bernd Paysan
Guest
|
Posted:
Mon Jan 17, 2005 8:34 pm Post subject:
Re: RISC vs. CISC design principles |
|
|
Terje Mathisen wrote:
| Quote: | In which case my original question returns:
Why aren't Intel/AMD doing this already?
|
You have to ask Fred Weber and Pat Gelsinger ;-).
For the Willamette/Northwood P4, doing moves doesn't hurt much. The
dependency is just a half-cycle, so a dependend add could executed in the
second half. For the Prescott P4, the dependency again adds a complete
cycle (no double-pumped ALU). The register renaming is there, and the
scheduler already is too aggressive for direct (or "naive") precise
exceptions, so the checkpoint/step forward mechanism is in place.
The Athlon64 engine doesn't have a double-pumped ALU, so the moves are a
concern. The past/future file are not renamed registers, there's a 1:1
mapping between architecture and physical register. The combination of
adjacent mov/op operations with the same target register still should be
possible, but since ATM there's a 1:n relation between instruction and
uops, and a less aggressive scheduler, the checkpoint/step forward
mechanism may not be used (it is possible to implement precise exceptions
with the Opteron execution unit without).
| Quote: | If a hw trap can modify the architectural registers (on the stack)
before returning, this still works because each POP reg instruction will
target a new rename register, right?
|
Exactly.
--
Bernd Paysan
"If you want it done right, you have to do it yourself"
http://www.jwdt.com/~paysan/ |
|
| Back to top |
|
 |
Guest
|
Posted:
Mon Jan 17, 2005 10:03 pm Post subject:
Re: RISC vs. CISC design principles |
|
|
"Why aren't Intel/AMD doing this already?"
Rumor has it that Dotham does this, at least for integer moves.
Mitch |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Mon Jan 17, 2005 10:10 pm Post subject:
Re: Unaligned accesses |
|
|
Nick Maclaren wrote:
| Quote: |
It gets worse - think TLBs on SMP systems. There have been reports of
some systems where there were not enough TLB entries to handle the
worst possible cases, which therefore could never complete! Whether
or not that is so, think of the following nightmare scenario:
|
I think this is a pretty clear example of a miss designed TLB.
Obviously the TLB must have enough entries to allow the instruction
to complete. (There is one case where this might happen, see below)
However there are some non optimal cases that could arise.
For example the worst case VAX instruction required 46 TLB entries.
If the TLB did not use a True-LRU replacement tracking, such that each
TLB lookup rotated that entry to the front, but used FIFO instead
then it is possible for a miss to cause a new entry to be loaded
which evicts an entry you also need, leading to another load.
In theory this could happen 46 times in a row.
The only architecture I ever heard of which might have had "no upper
limit on required TLB entries" was the Data General Eclipse.
It inherited a weird address mode from the Nova which could,
in principle, cause this.
The Nova had only 16 bit word access, no byte access.
If the 'chain bit' of the address (lsb or msb, I can't remember
which one) was zero then it was the address of the value. If it
was a one then it was a new address of the value, which was then
inspected for inspected for the 'chain' bit. This iterative
chaining would continue until a zero chain bit was encountered.
The Eclipse had virtual memory but was also backward compatible
to the Nova instruction set so might have had a problem.
However I would imagine DG solved the problem by simply
defining an upper limit on the length of an address chain.
So it was probably a non issue for all except the most exotic
users of this feature.
| Quote: | On an SMP system, a communication word (say one that is written from
one thread and read from the others) is unaligned. Now work out
what happens with the TLB misses and how much this will increase
the chances of the two halves of the word being inconsistent when
they are read.
|
The easy answer is don't define straddles as "guaranteed consistent".
But the TLB should probe all page addresses involved before starting
a write operation so you can't get a page fault in the middle of
updating a straddle.
| Quote: | The alternative is, of course, that the system has to support
multi-TLB lockdown, of some sort, which is not a pretty concept.
Remembering that it isn't even running privileged code, but a normal
application!
|
Or use the True-LRU entry tracking in the MMU I mentioned above.
Eric |
|
| Back to top |
|
 |
Guest
|
Posted:
Tue Jan 18, 2005 1:44 am Post subject:
Re: RISC vs. CISC design principles |
|
|
Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
| Quote: | What is wrong with using the register renaming unit to remove
reg-reg MOVes completely, by treating them like read-only shared
pages in an operating system?
Since the usual approach is to turn all instructions into
three-operand micro-ops that write to a new unique location anyway,
it would seem like such MOV opcodes really doesn't do anything at
all, except updating a scoreboard to remember which architeced
register is stored in which physical reg.
I can see how this could create a few more problems in case of a
trap/interrupt, but otherwise it seems like such an obvious idea
that there must be something wrong with it?
|
I asked this about eliminating register spills onto the stack some time
ago, and AIR recall there was a detailed answer from someone about how it
was done etc.
--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be. |
|
| Back to top |
|
 |
Guest
|
Posted:
Tue Jan 18, 2005 1:47 am Post subject:
Re: Unaligned accesses |
|
|
jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
| Quote: | Putting only a few extra gates on a chip to allow unaligned
accesses, and then warning programmers that these accesses will have
a performance penalty, so they should not be used unless really
needed, is usually the best tradeoff, though. It eliminates a
potential source of confusion and error at the lowest cost.
|
Because you are paying the gate delay penalty ofr EVERY access
that now has to go through them.
See the Alpha papers for the messy details, and more.
--
Paul Repacholi 1 Crescent Rd.,
+61 (08) 9257-1001 Kalamunda.
West Australia 6076
comp.os.vms,- The Older, Grumpier Slashdot
Raw, Cooked or Well-done, it's all half baked.
EPIC, The Architecture of the future, always has been, always will be. |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Tue Jan 18, 2005 1:48 am Post subject:
Re: RISC vs. CISC design principles |
|
|
MitchAlsup@aol.com wrote:
| Quote: | "Why aren't Intel/AMD doing this already?"
Rumor has it that Dotham does this, at least for integer moves.
|
<BG>
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
|
|
|
|