| Author |
Message |
Terje Mathisen
Guest
|
Posted:
Tue Jan 18, 2005 7:58 am Post subject:
Re: Unaligned accesses |
|
|
prep@prep.synonet.com wrote:
| Quote: | jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
Putting only a few extra gates on a chip to allow unaligned
accesses, and then warning programmers that these accesses will have
a performance penalty, so they should not be used unless really
needed, is usually the best tradeoff, though. It eliminates a
potential source of confusion and error at the lowest cost.
Because you are paying the gate delay penalty ofr EVERY access
that now has to go through them.
|
Is that really true?
_Either_ you pay the gate delay penalty of being able to detect
misaligned accesses, and convert those to a trap,
_or_ you pay the gate delay penalty of being able to detect misaligned
accesses, and convert those into slower/microcoded sequences.
:-)
I'll accept that generating a trap is probably easier, since you need
that for other problem cases (i.e. out-of-bounds) anyway, but the HW
that allows the cpu to do a realtime decision of the path to follow
should be very similar.
It is only if/when the trap is async that this really becomes worrysome,
since at this point the cpu much revert to the last checkpoint and
singlestep forward to the point of the trap.
If the same mechanism is used to handle misaligned accesses, then they
will be so slow as to make the alternate (aligned only) code sequence
faster except when misalignment is very rare.
OK, I guess I'm sorta/reluctantly agreeing with you. :-(
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Seongbae Park
Guest
|
Posted:
Tue Jan 18, 2005 4:58 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
John Savard <jsavard@excxn.aNOSPAMb.cdn.invalid> wrote:
| Quote: | On 15 Jan 2005 22:05:14 -0800, "MrTibbs" <jim2101024@comcast.net> wrote,
in part:
Putting only a few extra gates on a chip to allow unaligned accesses,
I'm not so sure it's just a few gates. What if the unaligned access
crosses a cache line boundary, and one line is in the cache and one
isn't? What if it crosses a page boundary, and blah blah...
You make it into a few gates by turning an unaligned access into
multiple accesses of smaller things.
|
You can't simply turn it into multiple smaller accesses
without locking multiple cache lines (or potentially even TLB entries
if it crosses page boundary)
if the ISA defines the memory operations to be atomic (most ISAs do).
Locking multiple anything will cost more than "just a few gates"
if otherwise you don't need to do so.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/" |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Jan 19, 2005 1:07 am Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
Seongbae Park wrote:
| Quote: | John Savard <jsavard@excxn.aNOSPAMb.cdn.invalid> wrote:
You make it into a few gates by turning an unaligned access into
multiple accesses of smaller things.
You can't simply turn it into multiple smaller accesses
without locking multiple cache lines (or potentially even TLB entries
if it crosses page boundary)
if the ISA defines the memory operations to be atomic (most ISAs do).
Locking multiple anything will cost more than "just a few gates"
if otherwise you don't need to do so.
|
In the cases we've been discussing allowing mis-aligned accesses to be
not atomic wouldn't cost anything at all:
After all this is what the alternative sequence have to do anyway, right?
I.e. I'd be perfectly happy with a "best effort" alignment handler in hw:
Load a single item (quickly) if aligned, otherwise load two items into
the barrel shifter, shift to align, and return the result.
This would be at least comparable to an explicit sw sequence to do the
same task, and it would simplify programming quite a bit.
(I.e. aligned writes and misaligned reades are nearly the same speed as
having both aligned on most x86 implementations!)
Using a LOCK prefix should trap in such a case.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Wed Jan 19, 2005 2:01 am Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
Terje Mathisen wrote:
| Quote: |
Seongbae Park wrote:
John Savard <jsavard@excxn.aNOSPAMb.cdn.invalid> wrote:
You make it into a few gates by turning an unaligned access into
multiple accesses of smaller things.
You can't simply turn it into multiple smaller accesses
without locking multiple cache lines (or potentially even TLB entries
if it crosses page boundary)
if the ISA defines the memory operations to be atomic (most ISAs do).
Locking multiple anything will cost more than "just a few gates"
if otherwise you don't need to do so.
In the cases we've been discussing allowing mis-aligned accesses to be
not atomic wouldn't cost anything at all:
|
Note that the Intel x86 does NOT guarantee atomic access to
nonaligned values that straddle 32 byte cache lines.
(Vol 3, Sys Prog Guide, section 7.1.1)
| Quote: | After all this is what the alternative sequence have to do anyway, right?
I.e. I'd be perfectly happy with a "best effort" alignment handler in hw:
Load a single item (quickly) if aligned, otherwise load two items into
the barrel shifter, shift to align, and return the result.
|
Most of this hw support would likely already be present in the L1 data
cache as it is required for byte and aligned word/dword/qword access.
Nonaligned access should require only minor extensions.
| Quote: | This would be at least comparable to an explicit sw sequence to do the
same task, and it would simplify programming quite a bit.
|
The sw trap incurs a pipeline flush that a hw sequencer does not.
| Quote: | (I.e. aligned writes and misaligned reades are nearly the same speed as
having both aligned on most x86 implementations!)
Using a LOCK prefix should trap in such a case.
|
Hmmm... what else might might be affected?
- Load-Store queue must do more complex overlap checks before
allowing read or write reordering
- On store operations that straddle pages, MMU must probe TLB for
both pages before starting so they do not fault half way through.
If both are valid then emit physical addresses to L1.
- Write combine buffer must do more complex check for straddles.
Also must try not to evict one needed part when loading another.
Anything else?
Eric |
|
| Back to top |
|
 |
Guest
|
Posted:
Wed Jan 19, 2005 3:26 am Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
You forgot that when you use segmentation, you can cause both cache
line misalignment and address (sub)space wrap around, causing you to
have to lock down 3 cache lines to do the misaligned stuff. Never
happens in practice, but you have to get the hardware right.
One difficulty in the faster pipelines is that one discovers the access
is misaligned a large number of cycles into the operation, causing an
abort and rerun (and a latency penalty). PPro and K5, K6 and Athlon
derivatives simply change the memory pipe from n cycles to (n+1) cycles
in mid-stream and run the pair of operations back to back through the
lowest level of the cache hierarchy. So, while the PPro derivatives,
and Athlon derivatives are insulated from much of this added rerun
latency, faster pipelines like P4 and other faster pipelines are not.
The LS queue ends up having to split the Queue into one element for
each misaligned chunk, loosing the one-to-one correspondance. This
causes naming headaches in the decoder/issue logic.
The fault handling logic must be prepaired to handle multiple
successive exceptions on a single memory reference. Consider a
misaligned store. A1 points to the last line in a page which does not
have write permission (copy on write), A2 points to the next line in a
page which is writable, but uncacheable, A3 (address wrap) crosses over
into a page that is write combigning. Yech, not undoable, not
particularly hard (for x86 designers) but Yech anyway. It is this set
of reasons that ultimately yeilds the conclusion that atomicity cannot
be guarenteed unless the unit addresses is within a single cache line.
(single cache lines are now 32B, 64B and 128B::also Yech, but this is
what you get when every company has to protect every scrap of IP from
every other company).
Mitch |
|
| Back to top |
|
 |
Wilco Dijkstra
Guest
|
Posted:
Wed Jan 19, 2005 3:41 am Post subject:
Re: Unaligned accesses |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:csieii$uf6$1@osl016lin.hda.hydro.com...
| Quote: | prep@prep.synonet.com wrote:
jsavard@excxn.aNOSPAMb.cdn.invalid (John Savard) writes:
Putting only a few extra gates on a chip to allow unaligned
accesses, and then warning programmers that these accesses will have
a performance penalty, so they should not be used unless really
needed, is usually the best tradeoff, though. It eliminates a
potential source of confusion and error at the lowest cost.
Because you are paying the gate delay penalty ofr EVERY access
that now has to go through them.
Is that really true?
|
No, it's not. Those extra gates are already needed to select words,
halfwords and bytes, endian swap them (perhaps dynamically) and
zero or signextend them as necessary. If you look at it you're close
to a full crossbar switch already, so it isn't much more work to
support unaligned accesses. Initial Alphas didn't support any of
this as they didn't have those gates indeed, but they did pay for this
as code using chars and shorts ran slow.
ARM is perhaps the only RISC that added support for unaligned
access due to customer demand. It speeds up code that occasionally
does do unaligned accesses as the cost on ARMs is high (>10x
slower than an aligned access as ARM has no funnel shifter).
It's essential for SIMD as unaligned access often outnumber
aligned ones (eg. SAD in motion estimation).
As Stephen Fuld guessed the hardware people didn't like it initially
but then again ARM already has instructions that can straddle up to 4
cache lines...
| Quote: | _Either_ you pay the gate delay penalty of being able to detect
misaligned accesses, and convert those to a trap,
_or_ you pay the gate delay penalty of being able to detect misaligned
accesses, and convert those into slower/microcoded sequences.
:-)
I'll accept that generating a trap is probably easier, since you need
that for other problem cases (i.e. out-of-bounds) anyway, but the HW
that allows the cpu to do a realtime decision of the path to follow
should be very similar.
|
Indeed you have a lot more time for a trap as you only have to generate
it just before the cache returns the hit signal. However generating an
unaligned signal is so easy it can be done during effective address
generation at virtually no cost. This can then be used to stall the load
store unit for an extra cycle to access the other cacheline (the ARM11
doesn this). If the execution units are statically scheduled you'll have
to replay the load, but since cachelines are large nowadays this doesn't
matter much (see below).
| Quote: | It is only if/when the trap is async that this really becomes worrysome,
since at this point the cpu much revert to the last checkpoint and
singlestep forward to the point of the trap.
If the same mechanism is used to handle misaligned accesses, then they
will be so slow as to make the alternate (aligned only) code sequence
faster except when misalignment is very rare.
|
Assuming a 10-cycle cost for an unaligned word access crossing a
64-byte cacheline it would take 192 cycles for the replay mechanism
to be worse! So in principle it would be possible to add unaligned
access to a CPU that doesn't support it by taking a trap, inserting the
instructions for an unaligned access using a micro code engine and still
get a (small) speedup :-)
Wilco |
|
| Back to top |
|
 |
Eugene Nalimov
Guest
|
Posted:
Wed Jan 19, 2005 7:57 am Post subject:
Re: RISC vs. CISC design principles |
|
|
"Anton Ertl" <anton@mips.complang.tuwien.ac.at> wrote in message
news:2005Jan16.165143@mips.complang.tuwien.ac.at...
| Quote: | ...
As for compilers, they often have options for optimizing for specific
microarchitectures, or to generate more general code. The compiler
will optimize for whatever you ask it to optimize for. Once the new
microarchitecture is on the market for some time, compilers will have
special optimizations for it, and if it is the most popular
microarchitecture, users will usually optimize for that, certainly not
for as many as possible (e.g., people don't optimize PC software for
the 386, 486 or P5 microarchitectures any more).
|
Compiler can do that, that is not a problem. Real problem is that
users do not update their makefiles.
Thanks,
Eugene |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Jan 19, 2005 12:35 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
Eric P. wrote:
| Quote: | Terje Mathisen wrote:
Hmmm... what else might might be affected?
- Load-Store queue must do more complex overlap checks before
allowing read or write reordering
|
Not too much though: Currently it must take into consideration both base
and length of each operations, this extension could conservatively
extend this to be the aligned base, and the extended length.
| Quote: |
- On store operations that straddle pages, MMU must probe TLB for
both pages before starting so they do not fault half way through.
If both are valid then emit physical addresses to L1.
- Write combine buffer must do more complex check for straddles.
Also must try not to evict one needed part when loading another.
|
None of these would seem to apply if the store that crosses a cache line
boundary is turned into multiple micro-ops, with traps allowed between
them. I.e. in case of a store that traps halfway, the first half could
get written either once or twice, with no guarantee of what would
actually happen, except that both halves would eventually make it to the
destination.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Jan 19, 2005 12:40 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
MitchAlsup@aol.com wrote:
| Quote: | The fault handling logic must be prepaired to handle multiple
successive exceptions on a single memory reference. Consider a
misaligned store. A1 points to the last line in a page which does not
have write permission (copy on write), A2 points to the next line in a
page which is writable, but uncacheable, A3 (address wrap) crosses over
into a page that is write combigning. Yech, not undoable, not
particularly hard (for x86 designers) but Yech anyway. It is this set
|
Ouch. Getting all corner cases right tend to be a hard problem indeed. :-(
| Quote: | of reasons that ultimately yeilds the conclusion that atomicity cannot
be guarenteed unless the unit addresses is within a single cache line.
(single cache lines are now 32B, 64B and 128B::also Yech, but this is
what you get when every company has to protect every scrap of IP from
every other company).
|
The good thing is that the average L1 cache line size seems to increase
at about the same rate as the maximum/average load unit size, which
means that the percentage of misaligned loads that would straddle a
boundary stays about the same or goes down.
I.e. just handling the 'easy' within a single cache line case is enough
to get very worthwhile speedups!
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Wed Jan 19, 2005 8:54 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
MitchAlsup@aol.com wrote:
| Quote: |
You forgot that when you use segmentation, you can cause both cache
line misalignment and address (sub)space wrap around, causing you to
have to lock down 3 cache lines to do the misaligned stuff. Never
happens in practice, but you have to get the hardware right.
|
I don't follow this. If a segment offset wraps around, it is the
same as linear address wrap around which is the same as any other
page straddle. I see only 2 pages and therefore 2 cache lines
being touched.
| Quote: | One difficulty in the faster pipelines is that one discovers the access
is misaligned a large number of cycles into the operation, causing an
abort and rerun (and a latency penalty). PPro and K5, K6 and Athlon
derivatives simply change the memory pipe from n cycles to (n+1) cycles
in mid-stream and run the pair of operations back to back through the
lowest level of the cache hierarchy. So, while the PPro derivatives,
and Athlon derivatives are insulated from much of this added rerun
latency, faster pipelines like P4 and other faster pipelines are not.
The LS queue ends up having to split the Queue into one element for
each misaligned chunk, loosing the one-to-one correspondance. This
causes naming headaches in the decoder/issue logic.
|
The diagrams for Intel Pentium 4 (and in H&P) don't go into
much detail on the exactly how the various modules are organized.
For example, the Pentium 4 Microarchitecture diagram just
shows a large blob called "L1 Data Cache", which actually comprises
Load-Store Queue (LSQ), Data TLB (DTLB), Write Combine Buffer (WCB),
Store-Load Forwarder (SLF) and the L1 cache itself.
One possible organization is:
DTLB
^
|
LSQ->MMU--->SLF->L1
|
v
WCB
I had imagined, possibly erroneously, that the LSQ was
located prior to the DTLB and contained virtual addresses.
Read bypass checks are therefore done with virtual addresses.
The LSQ feeds the MMU which checks the DTLB and WCB at the same time.
If the address misses the WCB, the MMU checks one or two pages and
then emits one or two physical addresses with cache control flags
to the SLF and L1 (or possibly skips the cache if flags indicate).
In a sence, each LSQ element acts as a reservation station for
sequencing a load or store operation on the MMU function unit.
But I can also see other organizations, such as putting the LSQ
after the MMU and doing read bypass checks on physical addresses.
The LSQ could combine with the Store-Load Forwarder.
However that leads to the added problem you noted, whereby a
single instruction maps to one or two LSQ entries.
| Quote: | The fault handling logic must be prepaired to handle multiple
successive exceptions on a single memory reference. Consider a
misaligned store. A1 points to the last line in a page which does not
have write permission (copy on write), A2 points to the next line in a
page which is writable, but uncacheable, A3 (address wrap) crosses over
into a page that is write combigning. Yech, not undoable, not
particularly hard (for x86 designers) but Yech anyway. It is this set
of reasons that ultimately yeilds the conclusion that atomicity cannot
be guarenteed unless the unit addresses is within a single cache line.
(single cache lines are now 32B, 64B and 128B::also Yech, but this is
what you get when every company has to protect every scrap of IP from
every other company).
Mitch
|
Yes (pending further explaination of your 3 cache line stuff)
Eric |
|
| Back to top |
|
 |
Anton Ertl
Guest
|
Posted:
Wed Jan 19, 2005 9:29 pm Post subject:
Re: RISC vs. CISC design principles |
|
|
"Eugene Nalimov" <eugenen@microsoft.com> writes:
| Quote: | "Anton Ertl" <anton@mips.complang.tuwien.ac.at> wrote in message
news:2005Jan16.165143@mips.complang.tuwien.ac.at...
...
As for compilers, they often have options for optimizing for specific
microarchitectures, or to generate more general code. The compiler
will optimize for whatever you ask it to optimize for. Once the new
microarchitecture is on the market for some time, compilers will have
special optimizations for it, and if it is the most popular
microarchitecture, users will usually optimize for that, certainly not
for as many as possible (e.g., people don't optimize PC software for
the 386, 486 or P5 microarchitectures any more).
Compiler can do that, that is not a problem. Real problem is that
users do not update their makefiles.
|
Well, I guess that those who care for performance update their
Makefiles. Of course, many of programs for ancient CPUs run fast
enough on current ones even with suboptimal options.
- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Jan 19, 2005 9:39 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
Eric P. wrote:
| Quote: | MitchAlsup@aol.com wrote:
You forgot that when you use segmentation, you can cause both cache
line misalignment and address (sub)space wrap around, causing you to
have to lock down 3 cache lines to do the misaligned stuff. Never
happens in practice, but you have to get the hardware right.
I don't follow this. If a segment offset wraps around, it is the
same as linear address wrap around which is the same as any other
page straddle. I see only 2 pages and therefore 2 cache lines
being touched.
|
I don't follow this either, except possibly in the case of 64K
wraparound for 16-bit code:
32-bit code cannot access more than 4 GB, which also happens to be the
page limit, and if you set the segment limit at less than 4 GB, then
you'll generate a trap when trying to access past the end, right?
I.e. there has never been any architectural definition for segment
wraparound as some randomly aligned segment end.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Wed Jan 19, 2005 11:57 pm Post subject:
Re: Unaligned accesses (was Re: RISC vs. CISC design princip |
|
|
In article <csldqe$nra$3@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
| Quote: | The good thing is that the average L1 cache line size seems to increase
at about the same rate as the maximum/average load unit size, which
means that the percentage of misaligned loads that would straddle a
boundary stays about the same or goes down.
|
Is this a statement about your personal codes, or do you really think
this generalization applies to everyone? Lots of codes can't use SIMD
instructions, so their load unit size hasn't changed at all.
-- greg |
|
| Back to top |
|
 |
Eugene Nalimov
Guest
|
Posted:
Thu Jan 20, 2005 12:24 am Post subject:
Re: RISC vs. CISC design principles |
|
|
"Anton Ertl" <anton@mips.complang.tuwien.ac.at> wrote in message
news:2005Jan19.172902@mips.complang.tuwien.ac.at...
| Quote: | "Eugene Nalimov" <eugenen@microsoft.com> writes:
"Anton Ertl" <anton@mips.complang.tuwien.ac.at> wrote in message
news:2005Jan16.165143@mips.complang.tuwien.ac.at...
...
As for compilers, they often have options for optimizing for specific
microarchitectures, or to generate more general code. The compiler
will optimize for whatever you ask it to optimize for. Once the new
microarchitecture is on the market for some time, compilers will have
special optimizations for it, and if it is the most popular
microarchitecture, users will usually optimize for that, certainly not
for as many as possible (e.g., people don't optimize PC software for
the 386, 486 or P5 microarchitectures any more).
Compiler can do that, that is not a problem. Real problem is that
users do not update their makefiles.
Well, I guess that those who care for performance update their
Makefiles.
|
They should, but majority of them never do that. I am talking not
about some tiny program, but about large and complex projects
evolving for decade or more. Had you recently try to modify
"creatively written" 100k+ lines makefile?
That's why in the upcoming VC8 we are trying to generate the
code that will run equally well (or equally poor) on all x86 CPUs.
It looks that our customers prefer to slow down by 5% on all
CPUs than to slow down by 30% on some of them.
Thanks,
Eugene
|
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Jan 20, 2005 1:04 am Post subject:
Re: RISC vs. CISC design principles |
|
|
What is you definition for "all x86 CPUs"?
I would imagine it doesn't include outsiders like SiS, Transmeta and
Geode GX.
How about VIA?
Does the definition include P6 that currently has near-zero market
share but still dominates installed base? |
|
| Back to top |
|
 |
|
|
|
|