| Author |
Message |
Alexander Terekhov
Guest
|
Posted:
Thu Sep 01, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Err..
Alexander Terekhov wrote:
| Quote: |
Ricardo Bugalho wrote:
On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote:
I didn't bother to look at IA64 manual - anybody care to comment on this ?
but I suspect that IA64 is RCpc and the manual is exactly correct after
all.
It's RCpc indeed.
Not quite. Release stores to *WB* memory are constrained to ensure
"remote write atomicity". Classic RCpc is weaker in this respect
(and that's what makes RC != TSO). You better not rely on this
^ |
|
PC, not RC. -------------+
| Quote: | property because emulating it on CELLs (for example) will make your
ports run really slow. ;-)
|
regards,
alexander. |
|
| Back to top |
|
 |
Joe Seigh
Guest
|
Posted:
Thu Sep 01, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Ricardo Bugalho wrote:
| Quote: | On Wed, 31 Aug 2005 21:57:58 +0000, Seongbae Park wrote:
I didn't bother to look at IA64 manual - anybody care to comment on this ?
but I suspect that IA64 is RCpc and the manual is exactly correct after
all.
It's RCpc indeed.
|
So what does "manual is exactly correct" in this case mean? Are
IA-32 loads equivalent to IA64 ld.acq and they are not equivalent
to IA64 ld? I.e. the latter can't emulate a IA-32 load in all cases.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software. |
|
| Back to top |
|
 |
Alexander Terekhov
Guest
|
Posted:
Thu Sep 01, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Joe Seigh wrote:
[...]
| Quote: | Are IA-32 loads equivalent to IA64 ld.acq and they are not equivalent
to IA64 ld?
|
The ordering constraints are equivalent for IA32 loads and IA64 acquire
loads. But IA64 release stores to WB memory are more constrained than PC
stores, and IA32-under-IA64 effectively runs in TSO for WB memory, not
PC.
regards,
alexander. |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Thu Sep 01, 2005 9:46 pm Post subject:
Re: Intel x86 memory model question |
|
|
Ricardo Bugalho wrote:
| Quote: |
On Wed, 31 Aug 2005 18:02:34 -0400, Eric P. wrote:
I think the underlying question you asked about the x86 is:
Does the Intel Processor Consistency model require processors to wait
for all other processors to acknowledge receipt of their invalidates
before any are allowed to use the new value?
It does not.
The most straightforward example is buffered store forwarding: when a CPU
writes a value into memory, it can read it again directly from the store
buffer, even before it tries to make it visible to other processors.
|
I meant with regard to other processors not to itself.
Within a processor, yes, the docs explicitly state that
data from buffered writes can be forwarded to waiting reads.
As I understand it, while such local forwarding can have consequences
for consistency models, presumably because it allows subsequent
instructions to complete earlier than they otherwise would have,
it should not have an effect remote data update ordering.
In short, store to load forwarding, in and of itself, would not
allow a new value of Y to arrive at P3 before the new value of X.
For this to occur seems to me to require both of:
(a) the cache protocol to distribute updates in a non atomic manner by
allowing a new value to be available before all acks are received.
(b) the bus topology and protocol to somehow allow a message to get
from P1 to P2 then P2 to P3 passing the one from P1 to P3,
possibly due to an error and retransmit.
Eric |
|
| Back to top |
|
 |
Andy Glew
Guest
|
Posted:
Fri Sep 02, 2005 11:51 pm Post subject:
Re: Intel x86 memory model question |
|
|
Bottom quoting: asbestos donned!
I think that Joe Seigh has incorrectly assumed that processor
consistency implies (a) a global ordering of all loads, and (b) causal
ordering.
This is not true. At least, I am fairly certain that there is a
causal ordering memory model that is intermediate in semantics between
processor consistency and sequential consistency. (Google finfslots of
papers; I specifically recall Mossberger's survey.) And I do not
believe that I have ever seen a proof that processor consistency
implies a global ordering of all loads; I don't think such a proof
exists; I would be interested to see it if it does; and I strongly
suspect that there is a proof that orderings consistent with processor
consistency may violate causal ordering. Indeed, Joe may have
provided one.
(I do confess that I have occasionally wanted to move from processor
consistency to causal consistency, mainly because causal consistency
sounds like it should be easier to make proofs for; but I am not sure
if causal consistency is any easier to implement than sequential
consistency. Since sequential consistency is easy enough to
implement, I suspect that if we tighten up the memory model we will go
all the way.)
Nearly all statements in processor consistency are local.
For processors Pi, i = ...
Each Pi has a set of instructions Pi.Ij, some of which are loads, some
of which are stores. Notationally Pi.Lj and Pi.Sj, where the index
sets for Lj and Sj are not necessarily contiguous.
Each Pi also sees external stores in some order Pi.Xk.
The sequence of external stores seen by Pi, Pi.Xk, can be formed out
of an interleaving the set of stores from all other processors Pm.Sj,
m!=i. The only real constraint is that in this interleaving all of
the stores from a particular processor Pm.Sj appear in the order in
which they occurred on that processor; stores from a given processor
are not reordered in the sequence.
The sequence of external stores Pi.Xk is not necessarily equal to
Pj.Xk, for different processors i and j. I.e. although stores from
any single processor are performed in order at any other processor,
other processors do not necessarily see stores from different
processors interleaved in the same order. I.e. there is no single
global store order.
Instruction execution at a single Pi proceeds as if one instruction at
a time were executed, with some interleaving of the external stores
Pi.Xk. I.e. from the point of view of the local processor, it's loads
Pi.Lj are performed in order, and in order with the local stores
Pi.Sj. More specifically, there can be constructed an ordering Pi.Mx
which is an interleaving of Pi.Ij (and hence Pi.Lj and Pi.Sj) and
Pi.Xk, and local processor execution is consistent with such an
ordering Pi.Mx.
Note: we say "there can be constructed an ordering". But, so far as I
know, there is no easy way to construct such an ordering for an
particular processor. We know that one could be constructed, but we
don't know what it is. And certainly not an easy way to construct this
in an online manner.
And, again: there need not be a global ordering of stores from all
processors. And nor need there be a global ordering of loads.
A formal model must make a few more statements about the limited forms
of causality that are maintained in processor consistent system.
(E.g. two party causality; three party causality is not maintained, to
the best of my knowledge.) And, to be perfectly honest, I forget what
statements need to be made to differentiate between the two sub-types
of processor consistency: Gharacharloo type I and type II, where in
the latter you can forward from a store buffer (an implementation
consideration).
---
As Mitch says, the above can be briefly stated: WB memory is processor
consistent, type II. Describing the interaction of other memory types
is morecomplicated.
---
I do not know or care very much what the Itanium processor manual says
about x86 memory ordering. I wouldn't be surprised if they got it
wrong; or, as in the examples Joe provide, describe a mapping which
has explanatory value, but not definitional value.
---
Joe Seigh <jseigh_01@xemaps.com> writes:
| Quote: | MitchAlsup@aol.com wrote:
I didn't find it in the Intel book I have (Pentium Pro)
But chapter 7 in Volume 2 of AMD x86-64 Architecture Programmer's
Manual (System Programming) describes AMD's side of the situation,
starting on page 191 of the Purple Volume.
The problem is when you consider the number of memory modes {UC, CD,
WC, WP, WT and WB} that no simplistic statement can fully address what
the programmer can assume about memory and its ordering properties.
WriteBack (cacheable) memory is, however, Processor Consistent.
The argument being presented in c.p.t. is that processor consistency
implies loads are in order, perhaps instigated by something Andy Glew
said about this here
http://groups.google.com/group/comp.arch/msg/96ec4a9fb75389a2
AFAICT, this is not true for 3 or more processors. E.g.
processor 1 stores into X
processor 2 see the store by 1 into X and stores into Y
So the store into Y occurred after causal reasoning.
processor 3 loads from Y
processor 3 loads from X
If loads were in order you could infer that if processor 3
sees the new value of Y then it will see the new value of X.
But the rules for processor consistency *clearly* state that
you will necessarily see stores by different processors in
order.
While there are still ordering constraints on the loads they
don't have to be strictly in order as Andy incorrectly infers. |
|
|
| Back to top |
|
 |
Joe Seigh
Guest
|
Posted:
Sat Sep 03, 2005 12:15 am Post subject:
Re: Intel x86 memory model question |
|
|
Andy Glew wrote:
| Quote: |
Bottom quoting: asbestos donned!
I think that Joe Seigh has incorrectly assumed that processor
consistency implies (a) a global ordering of all loads, and (b) causal
ordering.
|
I think I was trying to prove that you couldn't imply global ordering
of loads.
Part of the problem is there's two target groups of programmers for
the memory model here. The processor consistency is alright if you're
doing HPC/parallel programming but isn't very useful if you're doing
general multi-threaded programming. There, all you really care about
is what the implicit global ordering between the various combinations
of loads and stores, and what memory barriers to use for the combinations
where ordering isn't defined.
In the ia32 docs, it's a little muddied because of the mention of
speculative loads. None the less I had assumed that loads weren't
ordered and that LFENCE or some other memory barrier or serializing
instruction was needed for global ordering of loads. However there
were some that claimed LFENCE wasn't needed. And the documentation
wasn't explicit enough to definitively counter their claims. And
it had to be really explicit given the rather incomprehensible
arguments they were presenting.
I've basically decided to ignore these people for now and stick with
my orginal interpretation of the ia32 memory model.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software. |
|
| Back to top |
|
 |
Alexander Terekhov
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Joe Seigh wrote:
[...]
| Quote: | We're assuming weakly ordered memory I think, whatever the typical multiprocessor
Intel box meant to run Linux or windows uses. Whatever "write-back cacheable"
is.
|
It means PC (apart from the non-temporal weakly ordered stuff) under x86
native (not Itanicized x86, i.e. TSO for WB instead of PC), and you don't
need LFENCE under PC.
regards,
alexander. |
|
| Back to top |
|
 |
Joe Seigh
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Alexander Terekhov wrote:
| Quote: | Joe Seigh wrote:
[...]
Whatever. I'm going to use LFENCE for situations where I'd use
#LoadLoad on sparc (generic, not assuming TSO).
You mean RMO? Reportedly, RMO is vaporware, so yeah, you'll get the
same "useful" effect on Sparc as on ia32 (weakly ordered WC memory
aside for a moment): none whatsoever.
In the same sense that Sparc documentation assumes the weakest possbile |
architected memory model when documenting usage of its memory barriers.
I know that some sparc processors only implement TSO and Solaris assumes
and requires TSO (so far).
It's possible Intel processors are all effectivly implemented as TSO, but we're
talking about the architected memory model and have to assume that unless
writing model dependent code.
I like how you sidestepped whether LFENCE or some serializing instruction
is required in some situations between sucessive loads on Intel ia32 processors.
We're assuming weakly ordered memory I think, whatever the typical multiprocessor
Intel box meant to run Linux or windows uses. Whatever "write-back cacheable"
is.
:
This whole thing is bizarre. Any other architecture, e.g. IBM Z architecture,
powerpc, sparc, alpha, ... and there's no problem in discussing whether
memory barriers are needed in certain situations. Only in Intel ia32 and only
when Alexander participates. However, if you filter out any comments by
Alexander then the problem goes away. I should have put in an Alexander filter
earlier. Then I wouldn't have raised this issue in the first place, which
has probably put *me* in a few filters. :)
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software. |
|
| Back to top |
|
 |
Joe Seigh
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Alexander Terekhov wrote:
| Quote: | Joe Seigh wrote:
[...]
In the ia32 docs, it's a little muddied because of the mention of
speculative loads. None the less I had assumed that loads weren't
ordered and that LFENCE or some other memory barrier or serializing
instruction was needed for global ordering of loads.
Neither will give you "global ordering of loads". Loads on ia32 are
in-order with respect to other loads and subsequent stores (by the
same processor). The only thing that differentiates PC from TSO is
the lack of remote write atomicity (in IA64 formal memory model
speak). Implementations (e.g. SPO) of course can do all sorts of
tricks to improve performance, but that doesn't change the memory
model. You're in denial.
|
Whatever. I'm going to use LFENCE for situations where I'd use
#LoadLoad on sparc (generic, not assuming TSO). And it's not
because I'm in denial. It's because nothing you say is
comprehensible. It's possible you are making some kind of
valid technical point but I have no way of telling.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software. |
|
| Back to top |
|
 |
Alexander Terekhov
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Joe Seigh wrote:
[...]
| Quote: | Whatever. I'm going to use LFENCE for situations where I'd use
#LoadLoad on sparc (generic, not assuming TSO).
|
You mean RMO? Reportedly, RMO is vaporware, so yeah, you'll get the
same "useful" effect on Sparc as on ia32 (weakly ordered WC memory
aside for a moment): none whatsoever.
regards,
alexander. |
|
| Back to top |
|
 |
Alexander Terekhov
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Joe Seigh wrote:
[...]
| Quote: | In the ia32 docs, it's a little muddied because of the mention of
speculative loads. None the less I had assumed that loads weren't
ordered and that LFENCE or some other memory barrier or serializing
instruction was needed for global ordering of loads.
|
Neither will give you "global ordering of loads". Loads on ia32 are
in-order with respect to other loads and subsequent stores (by the
same processor). The only thing that differentiates PC from TSO is
the lack of remote write atomicity (in IA64 formal memory model
speak). Implementations (e.g. SPO) of course can do all sorts of
tricks to improve performance, but that doesn't change the memory
model. You're in denial.
regards,
alexander. |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Joe Seigh wrote:
| Quote: |
Alexander Terekhov wrote:
Neither will give you "global ordering of loads". Loads on ia32 are
in-order with respect to other loads and subsequent stores (by the
same processor). The only thing that differentiates PC from TSO is
the lack of remote write atomicity (in IA64 formal memory model
speak). Implementations (e.g. SPO) of course can do all sorts of
tricks to improve performance, but that doesn't change the memory
model. You're in denial.
Whatever. I'm going to use LFENCE for situations where I'd use
#LoadLoad on sparc (generic, not assuming TSO). And it's not
because I'm in denial. It's because nothing you say is
comprehensible. It's possible you are making some kind of
valid technical point but I have no way of telling.
|
As I understand it, the key to causal ordering is Atomic Visibility
whereby a write becomes visible simultaneously to all processors
other than the one that issued the write. According to Gharacharloo,
Processor Consistency does not require updates be Atomically Visible
and, in theory allows non causal ordering of the kind in your
example. TSO does require Atomic Visibility.
The reason PC allows this rather dubious ordering appears to be so
as to not disallow caches using a Write Update (as opposed to Write
Invalidate) coherency protocol. Imposing Atomic Visibility on a
Write Update cache would be very difficult because each cache would
receive the updated value but then each would have to prevent that
value from being used until all peers had ack'ed. Imposing Atomic
Visibility on a Write Invalidate cache is much easier - just don't
give out the new value until all invalidate ack's are received.
(Others have pointed out, however, that Write Update caches are
undesirable for other reasons so PC appears to give up atomicity in
order to gain the ability to use a cache design that no one wants to.
Go figure.)
The text of LFENCE instruction in the Intel instruction manual says
"Performs a serializing operation on all load-from-memory instructions
that were issued prior the LFENCE instruction. This serializing
operation guarantees that every load instruction that precedes in
program order the LFENCE instruction is globally visible before any
load instruction that follows the LFENCE instruction is globally
visible. The LFENCE instruction is ordered with respect to load
instructions, other LFENCE instructions,"...
seems to provide the guarantees for globally visibility and
therefore causality that you are looking for.
Eric |
|
| Back to top |
|
 |
Alexander Terekhov
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
"Eric P." wrote:
[...]
| Quote: | The text of LFENCE instruction in the Intel instruction manual says
"Performs a serializing operation on all load-from-memory instructions
that were issued prior the LFENCE instruction. This serializing
operation guarantees that every load instruction that precedes in
program order the LFENCE instruction is globally visible before any
load instruction that follows the LFENCE instruction is globally
visible. The LFENCE instruction is ordered with respect to load
instructions, other LFENCE instructions,"...
seems to provide the guarantees for globally visibility and
|
What does "global visibility" means for loads under PC?
| Quote: | therefore causality that you are looking for.
|
So where do you put the fence, then?
: processor 1 stores into X
: processor 2 see the store by 1 into X and stores into Y
: processor 3 loads from Y
: processor 3 loads from X
regards,
alexander. |
|
| Back to top |
|
 |
Alexander Terekhov
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Joe Seigh wrote:
[... filters ...]
< Forward Quoted >
Newsgroups: comp.programming.threads
Subject: Re: Memory visibility and MS Interlocked instructions
From: David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
-------- Original Message --------
David Hopwood wrote:
| Quote: |
Alexander Terekhov wrote:
Andy Glew of Intel (sorta) confirmed that x86 is classic PC.
http://groups.google.de/group/comp.arch/msg/7200ec152c8cca0c
Joe Seigh wrote:
The argument being presented in c.p.t. is that processor consistency
implies loads are in order, perhaps instigated by something Andy Glew
said about this here
http://groups.google.com/group/comp.arch/msg/96ec4a9fb75389a2
and in another post:
| "loads in order" means #LoadLoad between loads.
AFAICT, this is not true for 3 or more processors. E.g.
processor 1 stores into X
processor 2 see the store by 1 into X and stores into Y
So the store into Y occurred after causal reasoning.
Processor consistency is weaker than causal consistency, remember.
processor 3 loads from Y
processor 3 loads from X
If loads were in order you could infer that if processor 3
sees the new value of Y then it will see the new value of X.
No.
Start with X == Y == 0.
P1: X := 1
P2: t := X;
if (t == 1) Y := 1
P3: u := Y
#LoadLoad // or acquire
v := X
{u == 1, v == 0} is possible. This is because P2 and P3 might see
the stores to X and Y in a different order, because they are made
by different processors. The #LoadLoad does not prevent this.
But the rules for processor consistency *clearly* state that
you will [not] necessarily see stores by different processors in
order.
While there are still ordering constraints on the loads they
don't have to be strictly in order as Andy incorrectly infers.
#LoadLoad between loads does not imply that you will necessarily
see stores by different processors in a single global order. That
is what you appear to be misunderstanding. In other words, there
is nothing inconsistent between what Andy Glew's post said, and
Alexander's assertion that load on x86 implies load.acq.
--
David Hopwood <david.nospam.hopwood@blueyonder.co.uk
|
regards,
alexander. |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Sat Sep 03, 2005 4:15 pm Post subject:
Re: Intel x86 memory model question |
|
|
Alexander Terekhov wrote:
| Quote: |
"Eric P." wrote:
[...]
The text of LFENCE instruction in the Intel instruction manual says
"Performs a serializing operation on all load-from-memory instructions
that were issued prior the LFENCE instruction. This serializing
operation guarantees that every load instruction that precedes in
program order the LFENCE instruction is globally visible before any
load instruction that follows the LFENCE instruction is globally
visible. The LFENCE instruction is ordered with respect to load
instructions, other LFENCE instructions,"...
seems to provide the guarantees for globally visibility and
What does "global visibility" means for loads under PC?
|
Point taken.
| Quote: | therefore causality that you are looking for.
So where do you put the fence, then?
: processor 1 stores into X
: processor 2 see the store by 1 into X and stores into Y
: processor 3 loads from Y
: processor 3 loads from X
regards,
alexander.
|
I was wondering that myself. How about:
P3:
LD X
LFENCE
LD Y
LFENCE
LD X
This does seem a terrible price to pay for the 'advantages' one
gets from giving up Atomic Visability.
In practice I would be surprised if this could ever really occur.
When Joe posted the example, I thought it was impossible.
I was surprised to find that it was, in theory, possible, at
least according to the Gharacharloo definition of PC.
I would be more surprised if there was one programmer in a
million who did not consider this a hardware bug or wrote
code that took this into account. I'd bet people code to TSO.
Eric |
|
| Back to top |
|
 |
|
|
|
|