Intel x86 memory model question
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Intel x86 memory model question
Goto page Previous  1, 2, 3, 4, 5  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Joe Seigh
Guest





Posted: Sat Sep 03, 2005 4:15 pm    Post subject: Re: Intel x86 memory model question Reply with quote

Alexander Terekhov wrote:
Quote:
So where do you put the fence, then?

: processor 1 stores into X
: processor 2 see the store by 1 into X and stores into Y
: processor 3 loads from Y
: processor 3 loads from X


Since this was my example I should clarify. It was meant to
show that PC alone wasn't sufficient to guarantee that if processor
3 saw the store into Y by processor 2 that it would see the
store into X by processor 1.

My understanding of the ia32 memory model is that you
need a fence instruction between the loads by processor 3
and a fence between the load and store by processor 2 to
make the guarantee work.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Back to top
Alexander Terekhov
Guest





Posted: Sat Sep 03, 2005 9:21 pm    Post subject: Re: Intel x86 memory model question Reply with quote

"Eric P." wrote:
[...]
Quote:
I was wondering that myself. How about:
P3:
LD X
LFENCE
LD Y
LFENCE
LD X

That won't change anything. For causality, you need to CAS X on P3.

Quote:

This does seem a terrible price to pay for the 'advantages' one
gets from giving up Atomic Visability.

Power architecture also doesn't guarantee atomic visibility.

Here's full-modes (apart from dd/cc stuff) load intrinsic in pseudo
code for CELLs and XBOXes. ;-)

http://tinyurl.com/83r9b

Constraint calculator:

http://tinyurl.com/9vamz

regards,
alexander.
Back to top
Alexander Terekhov
Guest





Posted: Sat Sep 03, 2005 9:33 pm    Post subject: Re: Intel x86 memory model question Reply with quote

Joe Seigh wrote:
[...]
Quote:
My understanding of the ia32 memory model is that you
need a fence instruction between the loads by processor 3

And what are you going to do on a (hypothetical) quad 486 (or
some other old ia32) box without SSE fences? ;-)

regards,
alexander.
Back to top
Alexander Terekhov
Guest





Posted: Sat Sep 03, 2005 9:40 pm    Post subject: Re: Intel x86 memory model question Reply with quote

Joe Seigh wrote:
[...]
Quote:
My understanding of the ia32 memory model is that you
need a fence instruction between the loads by processor 3

LFENCE/#LoadLoad is implied by processor consistency.

Quote:
and a fence between the load and store by processor 2 to
make the guarantee work.

#LoadStore fence for P2 (load X ... store Y) is also implied by
processor consistency.

So what's the point?

regards,
alexander.
Back to top
Eric P.
Guest





Posted: Sun Sep 04, 2005 12:15 am    Post subject: Re: Intel x86 memory model question Reply with quote

Alexander Terekhov wrote:
Quote:

"Eric P." wrote:
[...]
I was wondering that myself. How about:
P3:
LD X
LFENCE
LD Y
LFENCE
LD X

That won't change anything. For causality, you need to CAS X on P3.

Yeah. X could change again after the first fence. Silly me. :-)
I was trying to avoid the fact that the LFENCE definition does NOT
require all queued invalidates to be delivered before proceeding.
That might allow the update to X to remain outstanding.

It would be simpler if they had used definitions like for the
Alpha Memory Barrier MB instruction:

"MB and CALL_PAL IMB force all preceding writes to at least reach
their respective coherency points. This does not mean that main-memory
writes have been done, just that the order of the eventual writes is
committed.

MB and CALL_PAL IMB also force all queued cache invalidates to be
delivered to the local caches before starting any subsequent reads
(that may otherwise cache hit on stale data) or writes (that may
otherwise write the cache, only to have the write effectively
overwritten by a late-delivered invalidate)."

Quote:
Power architecture also doesn't guarantee atomic visibility.

Not that it is relevant to the x86, but a PowerPC 750 manual that
I have from 1999 says

"3.3.5.1 Performed Loads and Stores
The PowerPC architecture defines a performed load operation as one
that has the addressed memory location bound to the target register
of the load instruction. The architecture defines a performed store
operation as one where the stored value is the value that any other
processor will receive when executing a load operation."

This would seem to indicate that, at least for that model,
it used atomic visibility. It still needs sync instructions
to prevent load & store reordering or bypassing.

Eric
Back to top
Alexander Terekhov
Guest





Posted: Mon Sep 05, 2005 4:15 pm    Post subject: Re: Intel x86 memory model question Reply with quote

Andy Glew wrote:
[...]
Quote:
briefly stated: WB memory is processor consistent, type II.

Would you please confirm that in order to get SC semantics for x86 WB
memory, I just need to replace all loads by lock-cmpxchg with 42 in
accumulator and simply use resulting value in accumulator after cmpxchg
as load operation result... which would also provide store-load fencing
inside cmpxchg with respect to load from DEST?

TIA.

regards,
alexander.
Back to top
Eric P.
Guest





Posted: Mon Sep 05, 2005 4:15 pm    Post subject: Re: Intel x86 memory model question Reply with quote

Alexander Terekhov wrote:
Quote:

"Eric P." wrote:
[...]
I was wondering that myself. How about:
P3:
LD X
LFENCE
LD Y
LFENCE
LD X

That won't change anything. For causality, you need to CAS X on P3.

Does the following basically reflect your reasoning:

Scenario:
processor 1 stores into X
processor 2 see the store by 1 into X and stores into Y
processor 3 loads from Y
processor 3 loads from X

1) Processor Consistency intrinsically allows P3 to have a new
value for Y and a stale value for X. This can be accomplished,
for example, by allowing P1 to hand out new values for X to
some peers before ensuring all old values of X are invalid.

There may be an invalidate X winging its' way from P1 to P3,
but there is no guarantee when it will arrive (other than it
do so before the next store by P1 arrives at P3).

2) SFENCE "guarantees that the results of every store instruction
that precedes the store fence in program order is globally visible
before any store instruction that follows the fence."
This is intended for use with weak ordered memory types.

The guarantee is that the value will be 'globally visible' at
some time in the future and before the next store, NOT that it
will be globally visible at the end of the SFENCE.

When used with normal, Processor Consistency and Write Back caching
memory this is exactly the same guarantee as PC provides, therefore
the SFENCE does nothing to change invalidate delivery.

3) LFENCE does not explicitly guarantee to drain all pending
invalidates for a processor. However even assuming that was
just a documentation oversight and that it really does drain them,
since there is no guarantee that P3 will have received its
invalidate, an LFENCE on P3 does not guarantee X is not stale.
P3 can still receive the new Y, LFENCE to drain the invalidates
and read the old X.

(I considered whether LFENCE might perform a 'global sync' by
communicating with all peers and ensure there were no outstanding
invalidates/updates in flight to itself before the drain in order
to ensure X was up to date. However I don't believe this would
work unless the global sync was itself atomic.)

4) The only way to guarantee that a processor has the most recent
value of a location is to take ownership of the variable,
and that requires a write. Since we actually want to read X,
we use CAS (x86 LOCK CMPXCHG) to read the most recent value.

So in the presence of Processor Consistency, with its lack of
Atomic Visibility, then the causally consistent sequence is:

P3:
LD Y, r1
Loop:
LD X, r2
CAS X, r2, r2
BEZ Loop

Eric
Back to top
Alexander Terekhov
Guest





Posted: Mon Sep 05, 2005 4:15 pm    Post subject: Re: Intel x86 memory model question Reply with quote

"Eric P." wrote:
[...]
Quote:
Does the following basically reflect your reasoning:

[... 1 - 3 ...]

Yes.

Quote:
4) The only way to guarantee that a processor has the most recent
value of a location is to take ownership of the variable,
and that requires a write. Since we actually want to read X,
^^^^^^^^^^^^^^^^^^^^^^^^^


That's the key.

Quote:
we use CAS (x86 LOCK CMPXCHG) to read the most recent value.

So in the presence of Processor Consistency, with its lack of
Atomic Visibility, then the causally consistent sequence is:

P3:
LD Y, r1
Loop:
LD X, r2
CAS X, r2, r2
BEZ Loop

That will work too, but you don't really need to LD X and loop on
CAS compare failure given that x86's cmpxchg always makes a write.
"The destination operand is written back if the comparison fails;
otherwise, the source operand is written into the destination. (The
processor never produces a locked read without also producing a
locked write.)"

So just do cmpxchg(&X, 42, 42) which will perform locked read-write
(with its read part store-load fenced from prior writes, I infer).
You'll get classic SC if you replace all loads with cmpxchg(&X, 42,
42). That's my understanding, and I'm eagerly awaiting confirmation
from Andy Glew and/or someone from Intel hanging at C++ memory model
mailing list.

http://tinyurl.com/aqgjj

regards,
alexander.
Back to top
David Hopwood
Guest





Posted: Mon Sep 05, 2005 11:03 pm    Post subject: Re: Intel x86 memory model question Reply with quote

Eric P. wrote:
Quote:
Joe Seigh wrote:
Alexander Terekhov wrote:

Neither will give you "global ordering of loads". Loads on ia32 are
in-order with respect to other loads and subsequent stores (by the
same processor). The only thing that differentiates PC from TSO is
the lack of remote write atomicity (in IA64 formal memory model
speak). Implementations (e.g. SPO) of course can do all sorts of
tricks to improve performance, but that doesn't change the memory
model. You're in denial.

Whatever. I'm going to use LFENCE for situations where I'd use
#LoadLoad on sparc (generic, not assuming TSO). And it's not
because I'm in denial. It's because nothing you say is
comprehensible. It's possible you are making some kind of
valid technical point but I have no way of telling.

As I understand it, the key to causal ordering is Atomic Visibility
whereby a write becomes visible simultaneously to all processors
other than the one that issued the write. According to Gharacharloo,
Processor Consistency does not require updates be Atomically Visible
and, in theory allows non causal ordering of the kind in your
example. TSO does require Atomic Visibility.

Right.

[...]
Quote:
The text of LFENCE instruction in the Intel instruction manual says
"Performs a serializing operation on all load-from-memory instructions
that were issued prior the LFENCE instruction. This serializing
operation guarantees that every load instruction that precedes in
program order the LFENCE instruction is globally visible before any
load instruction that follows the LFENCE instruction is globally
visible. The LFENCE instruction is ordered with respect to load
instructions, other LFENCE instructions,"...

seems to provide the guarantees for globally visibility and
therefore causality that you are looking for.

It's not entirely clear what "globally visible" in the Intel manual
is supposed to mean in the terminology of
<http://research.compaq.com/wrl/people/kourosh/papers/1993-tr-68.pdf>,
but I think it means just "performed" (with respect to all processors),
*not* "globally performed".

--
David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
Back to top
David Hopwood
Guest





Posted: Mon Sep 05, 2005 11:14 pm    Post subject: Re: Intel x86 memory model question Reply with quote

Joe Seigh wrote:
Quote:
Alexander Terekhov wrote:

So where do you put the fence, then?

: processor 1 stores into X
: processor 2 see the store by 1 into X and stores into Y
: processor 3 loads from Y
: processor 3 loads from X

Since this was my example I should clarify. It was meant to
show that PC alone wasn't sufficient to guarantee that if processor
3 saw the store into Y by processor 2 that it would see the
store into X by processor 1.

My understanding of the ia32 memory model is that you
need a fence instruction between the loads by processor 3
and a fence between the load and store by processor 2 to
make the guarantee work.

My understanding is that if the claimed problem exists at all, adding
these fences won't fix it (as far as the model is concerned, possibly
as opposed to implementation details of specific chips).

--
David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
Back to top
Alexander Terekhov
Guest





Posted: Mon Sep 05, 2005 11:27 pm    Post subject: Re: Intel x86 memory model question Reply with quote

David Hopwood wrote:

[... SSE2 LFENCE ...]

Quote:
It's not entirely clear what "globally visible" in the Intel manual

It's just copy&paste leftover from SSE1 SFENCE description.

regards,
alexander.
Back to top
Joe Seigh
Guest





Posted: Tue Sep 06, 2005 12:15 am    Post subject: Re: Intel x86 memory model question Reply with quote

David Hopwood wrote:
Quote:
Joe Seigh wrote:

Alexander Terekhov wrote:

So where do you put the fence, then?

: processor 1 stores into X
: processor 2 see the store by 1 into X and stores into Y
: processor 3 loads from Y
: processor 3 loads from X


Since this was my example I should clarify. It was meant to
show that PC alone wasn't sufficient to guarantee that if processor
3 saw the store into Y by processor 2 that it would see the
store into X by processor 1.

My understanding of the ia32 memory model is that you
need a fence instruction between the loads by processor 3
and a fence between the load and store by processor 2 to
make the guarantee work.


My understanding is that if the claimed problem exists at all, adding
these fences won't fix it (as far as the model is concerned, possibly
as opposed to implementation details of specific chips).


The architected memory model as opposed to the implemented one?

"Despite the fact that Pentium 4, Intel Xeon, and P6 family
processors support processor ordering, Intel does not guarantee that future processors will
support this model. To make software portable to future processors, it is recommended that operating
systems provide critical region and resource control constructs and API’s (application
program interfaces) based on I/O, locking, and/or serializing instructions be used to synchronize
access to shared areas of memory in multiple-processor systems."

That one? And what to people think the memory model that only
"I/O, locking, and/or serializing instructions" can synchronize is?

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Back to top
David Hopwood
Guest





Posted: Tue Sep 06, 2005 12:15 am    Post subject: Re: Intel x86 memory model question Reply with quote

Joe Seigh wrote:
Quote:
David Hopwood wrote:
Joe Seigh wrote:
Alexander Terekhov wrote:

So where do you put the fence, then?

: processor 1 stores into X
: processor 2 see the store by 1 into X and stores into Y
: processor 3 loads from Y
: processor 3 loads from X

Since this was my example I should clarify. It was meant to
show that PC alone wasn't sufficient to guarantee that if processor
3 saw the store into Y by processor 2 that it would see the
store into X by processor 1.

My understanding of the ia32 memory model is that you
need a fence instruction between the loads by processor 3
and a fence between the load and store by processor 2 to
make the guarantee work.

My understanding is that if the claimed problem exists at all, adding
these fences won't fix it (as far as the model is concerned, possibly
as opposed to implementation details of specific chips).

The architected memory model as opposed to the implemented one?

Yes, that's what I said.

Quote:
"Despite the fact that Pentium 4, Intel Xeon, and P6 family
processors support processor ordering, Intel does not guarantee that
future processors will support this model. To make software portable
to future processors, it is recommended that operating systems provide
critical region and resource control constructs and API’s (application
program interfaces) based on I/O, locking, and/or serializing
instructions be used to synchronize access to shared areas of
memory in multiple-processor systems."

This is all perfectly sensible. "Future processors" from Intel are not
necessarily ISA-compatible with x86 anyway. For example, you need to
recompile to use long mode in EM64T. Also note that it doesn't say
"future x86 processors". Maybe they were talking about Itanic.

Even if they weren't talking about IA-64 or a different mode, it's
still a good idea to avoid dependencies on the memory model in
*applications*, since it is more difficult to change all apps that
have such dependencies than it is to change threading libraries in OS
and language implementations. In fact OS/lang-impl maintainers half
expect stuff to rot on new hardware, and hopefully remember what they
depended on. Application maintainers generally don't (if they ever
understood it in the first place). This is what I've been saying
consistently.

Anyway, this issue doesn't have anything to do with what we were talking
about, which is whether the current architected x86 model allows a
particular behaviour.

Quote:
That one? And what do people think the memory model that only
"I/O, locking, and/or serializing instructions" can synchronize is?

You're overanalysing a fairly loosely worded recommendation.

--
David Hopwood <david.nospam.hopwood@blueyonder.co.uk>
Back to top
Joe Seigh
Guest





Posted: Tue Sep 06, 2005 12:15 am    Post subject: Re: Intel x86 memory model question Reply with quote

David Hopwood wrote:
Quote:
Joe Seigh wrote:

"Despite the fact that Pentium 4, Intel Xeon, and P6 family
processors support processor ordering, Intel does not guarantee that
future processors will support this model. To make software portable
to future processors, it is recommended that operating systems provide
critical region and resource control constructs and API’s (application
program interfaces) based on I/O, locking, and/or serializing
instructions be used to synchronize access to shared areas of
memory in multiple-processor systems."


This is all perfectly sensible. "Future processors" from Intel are not
necessarily ISA-compatible with x86 anyway. For example, you need to
recompile to use long mode in EM64T. Also note that it doesn't say
"future x86 processors". Maybe they were talking about Itanic.

Even if they weren't talking about IA-64 or a different mode, it's
still a good idea to avoid dependencies on the memory model in
*applications*, since it is more difficult to change all apps that
have such dependencies than it is to change threading libraries in OS
and language implementations. In fact OS/lang-impl maintainers half
expect stuff to rot on new hardware, and hopefully remember what they
depended on. Application maintainers generally don't (if they ever
understood it in the first place). This is what I've been saying
consistently.

Yes, your adversion to anarchist application programmers doing their
own thing is well known. :)

Quote:

Anyway, this issue doesn't have anything to do with what we were talking
about, which is whether the current architected x86 model allows a
particular behaviour.

That one? And what do people think the memory model that only
"I/O, locking, and/or serializing instructions" can synchronize is?


You're overanalysing a fairly loosely worded recommendation.


I'm not sure what you're saying here. That all future processors
from Intel that don't have processor ordering won't be x86? And
that the synchronization intructions in these future processors
won't be similar to the one's in x86? That Intel is telling people
in an x86 manual to start writing portable code not now but when
they get to the future processor? That's a little strange even for
Intel.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Back to top
Andy Glew
Guest





Posted: Tue Sep 06, 2005 5:22 am    Post subject: Re: Intel x86 memory model question Reply with quote

Alexander Terekhov <terekhov@web.de> writes:

Quote:
So just do cmpxchg(&X, 42, 42) which will perform locked read-write
(with its read part store-load fenced from prior writes, I infer).
You'll get classic SC if you replace all loads with cmpxchg(&X, 42,
42). That's my understanding, and I'm eagerly awaiting confirmation
from Andy Glew and/or someone from Intel hanging at C++ memory model
mailing list.

42, eh? Sounds like a joke: Goodbye, and thanks for all the thrash...

I think that the overall intention is that placing MFENCE before and
after every memory reference is supposed to get you SC semantics.
However, MFENCE, LFENCE, and SFENCE were defined after my time, and I
suspect that their definitions are not quite complete enough for what
you want. In particular, *FENCE really only work wrt WC cacheable
memory, and do not drain external buffers such as may occur in bus
bridges. In general, the P6 and Wmt families' mechanism for ensuring
ordering, waiting for global observability, only works for perfectly
vanilla WC cacheable memory, and is frequently violated wrt other
memory types. So I do not want to guarantee that it will work for
things like WC cached memory that is private to a graphics
accelerator.

You may be right that using the cmpxchg as you describe achieves SC on
x86. However, I need to think about it a bit more, since the
reasoning you provide is implementation specific, not architectural.

(Note that an atomic RMW like cmpxchg could well be implemented
without any fencing semantics. I.e. atomic RMWs and memory
ordering/fencing are independent concepts. I argued for this in
Itanium; I am trying to remember if x86 required that the two be mixed
up together. I can't see why it should have... I.e. I am sure that
using cmpxchg as you describe need not provide SC on a reasonable
computer architecture. I just need to find out if x86 mixed the two up
for some legacy reasons. In the meantime: use the fences would be my
recommendation.)


Quote:
4) The only way to guarantee that a processor has the most recent
value of a location is to take ownership of the variable,
and that requires a write. Since we actually want to read X,
^^^^^^^^^^^^^^^^^^^^^^^^^

That's the key.

we use CAS (x86 LOCK CMPXCHG) to read the most recent value.

Flawed argument.

It is entirely possible to imagine implementations of CAS that do not
write the variable if the value is unchanged.

Quote:
That will work too, but you don't really need to LD X and loop on
CAS compare failure given that x86's cmpxchg always makes a write.
"The destination operand is written back if the comparison fails;
otherwise, the source operand is written into the destination. (The
processor never produces a locked read without also producing a
locked write.)"

You are confusing implementation with semantics.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5  Next
Page 3 of 5

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB