CAS and LL/SC (was Re: High Level Assembler for MVS & VM & V
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
CAS and LL/SC (was Re: High Level Assembler for MVS & VM & V
Goto page Previous  1, 2, 3, ... 24, 25, 26  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
John Mashey
Guest





Posted: Thu Dec 16, 2004 4:40 am    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

Actually [now that my posts aren't disappearing any more :-)], let me
review what *really* was going on in 1985 with R2000.

1) We didn't really expect people to be using Lamport's algorithm,
except as a last resort.

2) UNIXes of the time had semaphores via semget(), etc, which were
syscalls.

3) For user-level code,I was unable to find any consistent expectation
across the industry of:
a) Specific hardware primitives for synchronization
b) User-level APIs to access these
and in fact, computer systems vendors, especially those doing bus-based
multiprocessors, often denigrated the usual CPU instructions in favor
of special extra hardware for synchronizatiion, like the Sequent SLIC
bus. In particular, SMP folks were horrified at instructions that did
complete bus-locks.

4) At that point, UNIX kernel code had not generally been reworked for
fine-grain parallelism on SMPs, and certainly, uniprocessrs could quite
happily do efficient mutual exclusion via set-priority-level to turn
off interrupts.

5) There simply didn't exist a lot of user-level code (on UNIX, anyway)
that made large use of lightweight threads, or that did automatic
parallelization and expected really efficient user-level
synchronization operations.

6) Hence, I didn't feel at all bad about leaving the synch ops out,
since I couldn't tell the chip folks anything I could guarantee I'd be
happy with for many years [and once it's there, it's there]. However,
I'd had in the back of my mind that we should do a set of functions of
the C library with the common mechanisms, that turned into fast kernel
calls, so that code would get written to a standard API (for us at
least), and that later, we'd slide some new instructions underneath the
API .... a good plan, but in the terrible crush of other things to do,
I forgot about it.

Unlike some other areas, I regard this (synchronization operations) as
one of several areas where I'm still not comfortable that the industry
has produced many mechanisms that simultaneously have:
- consistent APIs, especially across vendors
- simple, cheap low-end implementations
- good scaling on large machines
Back to top
Sylvek
Guest





Posted: Thu Dec 16, 2004 12:02 pm    Post subject: ulocks.h (was: Re: CAS and LL/SC) Reply with quote

Hi John!

"John Mashey" <old_systems_guy@yahoo.com> wrote:
Quote:
I'd had in the back of my mind that we should do a set of functions of
the C library with the common mechanisms, that turned into fast kernel
calls, so that code would get written to a standard API (for us at
least), and that later, we'd slide some new instructions underneath the
API .... a good plan, but in the terrible crush of other things to do,
I forgot about it.

How about <ulocks.h>? I think you are describing just that API.
I even went to http://techpubs.sgi.com and verified that I'm not
confabulating. In the IRIX 5.3 archive there are man pages
for usinit(3P) and hl(7M). hl(7m) was actual hardware implemetation
of spinlocks on the (processor or I/O) boards of SGI machines.

I do remember that there were 4 or 3 code paths in that library:
1) >= R4000 uni-p
2) <= R3000 uni-p
3) >= R4000 multi-p (used LL/SC)
4) <= R3000 multi-p (used /dev/hl/hlock???)

I don't recall whether path (1) and (2) was the same or there
were two processor-specific ones. But I do remember that even
under IRIX 4.0.5 with statically linked COFF executables there
was similar taxonomy of synchronization primitives. The paths
were selected at runtime during the initialization of C library.

Personally, I think that <ulocks.h> is the least weird collection
of synchronization primitives that I have ever seen in a commercial
product (as opposed to the academia-only product). Certainly among
the ones supported on IRIX <ulocks.h> was THE set to use. Least
overhead and least buggy. I seem to recall only one patch for
usinit(3p) stuff, whereas there was a plentitude of pthreads
patches.

Or maybe I'm mixing you up with John McCalpin? If so, then please
accept my apologies.

Could you shed some light on my conundrum? Or maybe someone who
has access to SGI box can send the <ulocks.h> header to Nick
Maclaren for a quick critique? I certainly think that it would
be both fun and productive to write a couple of posts about
this API.

Regards,

Sylvester
Back to top
Jan Vorbrüggen
Guest





Posted: Thu Dec 16, 2004 1:39 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

Quote:
This doesn't make sense to me
- unless you stop all other threads between LL/SC and/or mask interrupt,
the semantics can not be preserved.
How do you guarantee there will be no intervening store
to the same location from other processor ?

You don't - neither does the hardware SC instruction. Remember what the
C stands for?

All you need is to emulate the hardware's behaviour with respect to interrupts
and context switches. That's where some of the performance loss derives from,
I dare say.

Jan
Back to top
Jan Vorbrüggen
Guest





Posted: Thu Dec 16, 2004 1:48 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

Quote:
4) At that point, UNIX kernel code had not generally been reworked for
fine-grain parallelism on SMPs, and certainly, uniprocessrs could quite
happily do efficient mutual exclusion via set-priority-level to turn
off interrupts.

5) There simply didn't exist a lot of user-level code (on UNIX, anyway)
that made large use of lightweight threads, or that did automatic
parallelization and expected really efficient user-level
synchronization operations.

Both the kernel and some important user-mode applications (e.g., BACKUP)
were heavily multi-threaded on VMS at the time, and I believe around 1985
it started to support SMP systems. There also was a library to support
parallel applications across processes, with primitives to map shared
memory areas and to control access to them. The kernel continued to use
synch primitives that the architecture guaranteed to behave properly
based on single instructions (INS/REMQUHI) or on a single processor, as
appropriate, or used fine-grained spinlocks otherwise.

Jan
Back to top
Terje Mathisen
Guest





Posted: Thu Dec 16, 2004 3:00 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

Joe Seigh wrote:
Quote:

John Mashey wrote:

Unlike some other areas, I regard this (synchronization operations) as
one of several areas where I'm still not comfortable that the industry
has produced many mechanisms that simultaneously have:
- consistent APIs, especially across vendors
- simple, cheap low-end implementations
- good scaling on large machines

Except that the corresponding synchronization technique has to scale as well.
Having a synchronization primative for mutual exclution that scales well won't
do you any good because mutual exclusion won't scale up.

There are some techniques such as certain lock-free algorithms that do scale
well. But you're not going see hardware changed to take advantage of those
techniques because in general hardware types don't listen to software types
as obviously we don't know what we're doing.

Intel added XADD very late in the x86 life, afaik this must have been
specifically because it's a very useful primitive for lock-free
algorithms with guaranteed forward progress.

I remember a few years back we discussed how you could implement various
data structures using nothing but XADD. It seems like it should be quite
easy, but as usual, 'the devil is in the details'.

My best effort at the time required something like 4 or 6 XADD-updated
variables for a FIFO queue, and you still needed to pre-allocate the
maximum size of the queue in the form of an array of pointers.

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Terje Mathisen
Guest





Posted: Thu Dec 16, 2004 3:09 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

Joe Seigh wrote:
Quote:
Terje Mathisen wrote:
AFAIR, Lamport's algorithm depends on having unlimited range counters!

This does not work for unlimited uptimes, but might be good enough,
particularly on 64-bit machines.


Peterson's algorithm fixes that. It's also a distributed algorithm like
Lamport's.

What about the required memory barriers?

It seems like all these algorithms (Lamport, Dekker, Peterson,
Bakery...) starts with the assumption that all memory updates are
globally consistent, i.e. that all memory operations will be globally
visible in program order, right?

Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Nick Maclaren
Guest





Posted: Thu Dec 16, 2004 3:29 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

In article <1103156006.597207.44180@z14g2000cwz.googlegroups.com>,
John Mashey <old_systems_guy@yahoo.com> wrote:
Quote:
Actually [now that my posts aren't disappearing any more :-)], let me
review what *really* was going on in 1985 with R2000.

It wasn't just MIPS, and it was very much a rerun of mainframes'
approaches to the same issues. But it, eventually, got further.

Quote:
4) At that point, UNIX kernel code had not generally been reworked for
fine-grain parallelism on SMPs, and certainly, uniprocessrs could quite
happily do efficient mutual exclusion via set-priority-level to turn
off interrupts.

It still hasn't. Any site that pushes the limits (either in number of
SMP CPUs or in types of use of parallelism) is likely to hit parts of
the kernel that don't support that. But, now, perhaps 90% of shared
memory workloads run tolerably well, where it was more like 5% then.

Quote:
5) There simply didn't exist a lot of user-level code (on UNIX, anyway)
that made large use of lightweight threads, or that did automatic
parallelization and expected really efficient user-level
synchronization operations.

There still isn't. Not least because such primitives had a brief
existence in specialist systems in the late 1970s and early 1980s,
but had to take second place to the ability to run a general purpose
Unix system in the late 1980s. I.e. the current state is that there
are enough primitives on some systems, but only subject to one
condition:

The application and system configuration must be configured and
tuned for each other.

Inter alia, this prevents genuine portability and generally means that
privilege is needed to control some critical aspects.

Quote:
Unlike some other areas, I regard this (synchronization operations) as
one of several areas where I'm still not comfortable that the industry
has produced many mechanisms that simultaneously have:
- consistent APIs, especially across vendors
- simple, cheap low-end implementations
- good scaling on large machines

I know of nothing that achieves even one of those.


Regards,
Nick Maclaren.
Back to top
Joe Seigh
Guest





Posted: Thu Dec 16, 2004 5:16 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

Terje Mathisen wrote:
Quote:

Joe Seigh wrote:

There are some techniques such as certain lock-free algorithms that do scale
well. But you're not going see hardware changed to take advantage of those
techniques because in general hardware types don't listen to software types
as obviously we don't know what we're doing.

Intel added XADD very late in the x86 life, afaik this must have been
specifically because it's a very useful primitive for lock-free
algorithms with guaranteed forward progress.

I remember a few years back we discussed how you could implement various
data structures using nothing but XADD. It seems like it should be quite
easy, but as usual, 'the devil is in the details'.

My best effort at the time required something like 4 or 6 XADD-updated
variables for a FIFO queue, and you still needed to pre-allocate the
maximum size of the queue in the form of an array of pointers.


Fetch-and-op primatives are nice. Intel has the misnamed fetchadd in their
Itanium architecture (it should be named fetchinc).

I'm thinking of other things. We have synchronization techniques that are
in theory more cache friendly since they'll tolerate stale cache. But there's
no way to exploit that in current hardware. That's on the the reasons I think
shared memory in hardware will go away and we'll go to shared memory in software
where we can implement the needed functionality. That seems to be the trend anyway.

BTW, I came up with a lock-free 63 bit counter (from 32 bit atomic ops) which
I posted on comp.lang.asm.x86. Sort of a novelty item since I don't really
need it. The bizarro techique I derived it from is more efficient for my
purposes. But I also figured out how to make your 64 bit atomic counter from
a previous discussion in that group work for multiple writers. It also
avoids the need for a store memory barrrier though it's not likely you will
see IA-32 without ordered stores.

Joe Seigh
Back to top
Joe Seigh
Guest





Posted: Thu Dec 16, 2004 5:20 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

Terje Mathisen wrote:
Quote:

Joe Seigh wrote:
Terje Mathisen wrote:
AFAIR, Lamport's algorithm depends on having unlimited range counters!

This does not work for unlimited uptimes, but might be good enough,
particularly on 64-bit machines.


Peterson's algorithm fixes that. It's also a distributed algorithm like
Lamport's.

What about the required memory barriers?

It seems like all these algorithms (Lamport, Dekker, Peterson,
Bakery...) starts with the assumption that all memory updates are
globally consistent, i.e. that all memory operations will be globally
visible in program order, right?


Correct. Plus they require the reads and writes to be atomic. But that's
a much simpler set of primatives and thus more likely to exist.

Joe Seigh
Back to top
Joe Seigh
Guest





Posted: Thu Dec 16, 2004 5:46 pm    Post subject: Re: ulocks.h (was: Re: CAS and LL/SC) Reply with quote

Sylvek wrote:

Quote:

Could you shed some light on my conundrum? Or maybe someone who
has access to SGI box can send the <ulocks.h> header to Nick
Maclaren for a quick critique? I certainly think that it would
be both fun and productive to write a couple of posts about
this API.


Fun maybe but certainly not productive since no one else will have any idea
what is in the api.

Joe Seigh
Back to top
Nick Maclaren
Guest





Posted: Thu Dec 16, 2004 5:51 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

In article <41C17DFE.8E7D853A@xemaps.com>,
Joe Seigh <jseigh_01@xemaps.com> wrote:
Quote:

Fetch-and-op primatives are nice. Intel has the misnamed fetchadd in their
Itanium architecture (it should be named fetchinc).

Do you mean atomic fetch-operate-and-store? While those are very
useful (e.g. they are immune to contention-induced livelock), they
need some support from the memory system.

Quote:
I'm thinking of other things. We have synchronization techniques that are
in theory more cache friendly since they'll tolerate stale cache. But there's
no way to exploit that in current hardware. That's on the the reasons I think
shared memory in hardware will go away and we'll go to shared memory in software
where we can implement the needed functionality. That seems to be the trend anyway.

Eh? I don't understand what you mean.

Firstly, one problem with many existing techniques is that make implicit
assumptions about the threads either being on different CPUs or being
on the same CPU. Yet software is provided with no way of controlling
such things at the thread level. Similar remarks can be made about
primitives like 'suspend me until thread X has had a time slice';
POSIX allows some control, but not enough to ensure success (e.g.
I can find nothing that requires a useful minimum time slice length).

But, more seriously, I can't make head or tail of your last two sentences.
Software-implemented shared memory has been attempted many times, and has
invariably been a complete disaster - at the VERY least, all current
designs make the hardware responsible for the coherence (with some uses
of VERY high-priority interrupts and firmware).


Regards,
Nick Maclaren.
Back to top
Nick Maclaren
Guest





Posted: Thu Dec 16, 2004 5:59 pm    Post subject: Re: ulocks.h (was: Re: CAS and LL/SC) Reply with quote

In article <fqawd.21552$1%.13886@fed1read03>, Sylvek <__THINK__@cox.net> wrote:
Quote:

Could you shed some light on my conundrum? Or maybe someone who
has access to SGI box can send the <ulocks.h> header to Nick
Maclaren for a quick critique? I certainly think that it would
be both fun and productive to write a couple of posts about
this API.

I no longer manage an SGI system with more than one processor.
It is quite possible that the facility is one of the best around
but, from the dates in /usr/include/ulocks.h, it is one of the many
designs that did not make it into POSIX. I know no more about it.


Regards,
Nick Maclaren.
Back to top
Joe Seigh
Guest





Posted: Thu Dec 16, 2004 6:49 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

Nick Maclaren wrote:
Quote:

In article <41C17DFE.8E7D853A@xemaps.com>,
Joe Seigh <jseigh_01@xemaps.com> wrote:

Fetch-and-op primatives are nice. Intel has the misnamed fetchadd in their
Itanium architecture (it should be named fetchinc).

Do you mean atomic fetch-operate-and-store? While those are very
useful (e.g. they are immune to contention-induced livelock), they
need some support from the memory system.

I didn't coin the term. I believe it's just a summary of the names that were
used, e.g. fetch-and-add, fetch-and-increment, fetch-and-set, fetch-and-or, etc...
The Itanium fetchadd looks like it could operate either way, either by locking
cache lines or via the memory subsystem.

Quote:

I'm thinking of other things. We have synchronization techniques that are
in theory more cache friendly since they'll tolerate stale cache. But there's
no way to exploit that in current hardware. That's on the the reasons I think
shared memory in hardware will go away and we'll go to shared memory in software
where we can implement the needed functionality. That seems to be the trend anyway.

Eh? I don't understand what you mean.

Firstly, one problem with many existing techniques is that make implicit
assumptions about the threads either being on different CPUs or being
on the same CPU. Yet software is provided with no way of controlling
such things at the thread level. Similar remarks can be made about
primitives like 'suspend me until thread X has had a time slice';
POSIX allows some control, but not enough to ensure success (e.g.
I can find nothing that requires a useful minimum time slice length).

I generally avoid using techniques that have those problems. Lock-free techniques
don't require forward progress by other threads so whether they're on the same
cpu or some other cpu doesn't make any difference.

Quote:

But, more seriously, I can't make head or tail of your last two sentences.
Software-implemented shared memory has been attempted many times, and has
invariably been a complete disaster - at the VERY least, all current
designs make the hardware responsible for the coherence (with some uses
of VERY high-priority interrupts and firmware).

Just what I think the trend is. Obviously if software implemented shared memory
has been a wild sucess instead of a disaster, it couldn't be a trend. We'd be
there already. Bottom line, probability of sucessfully implementing software
shared memory 0.00 ... (as many zeros as you want) ... 01. Probability of
hardware designers listening to suggestions from software types, 0.0.

Joe Seigh
Back to top
Maciej W. Rozycki
Guest





Posted: Thu Dec 16, 2004 7:04 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

On Wed, 15 Dec 2004, Nick Maclaren wrote:

Quote:
Interesting. The lack of such instructions effectively means that
shared-memory, parallel-thread applications are unsupported, but that
was not a serious issue then. Nor was the scalability impact on the
kernel.

Note that as long as you go UP this can be handled at the OS level. For
example Linux handles user mode RI traps on LL and SC when run on MIPS-I
processors and emulates the instructions. There's a considerable
performance loss, of course, but the semantics of these operations is
preserved.

This doesn't make sense to me
- unless you stop all other threads between LL/SC and/or mask interrupt,
the semantics can not be preserved.
How do you guarantee there will be no intervening store
to the same location from other processor ?

That's why we do it for uniprocessor (UP) only. Having no MIPS-I
multiprocessor hardware available we've not faced yet the problem of
implementing atomic operations there. Given the rarity of such systems
it's likely not going to happen.

Of course you've used the term "parallel-thread applications" above which
implies two CPUs at least, and which I've unfortunately missed upon my
first read -- sorry.

Quote:
This confusion is my fault - I didn't explain in enough detail, and
Maciej Rozyck has assumed a different sense of supported to you and me.

There is a level at which semantics is preserved, but there is another
in which it cannot be. For example, there are (correct) codes which
will always terminate if run on a uniprocessor (with plausibly coarse
scheduling) but not necessarily do so if each thread has its own CPU,
and there are (correct) codes which do the converse.

For many OSes (including Linux), you can't guarantee any particular
timing of execution anyway as it may depend on external events like
processing of hardware interrupts which "steal" time from user processes.
The latency effect of emulating LL and SC is not much different from
taking a hardware interrupt between LL and SC (for the SC failing case) or
either immediately before LL or immediately after SC (for the SC
succeeding case). And you can have interrupts that happen at
deterministic intervals.

Certainly you are correct these factors need to be taken into account
when implementing algorithms that are meant to work on real equipment.

Quote:
No competent software engineer would ever analyse an algorithm for
best-case operation alone. Exactly what you do analyse will depend on
the requirement, and "average case under heavy contention" is a very
common target. So is "worst plausible case".

Indeed. And with LL/SC you actually risk a livelock if you don't handle
contention properly.

Maciej
Back to top
Joe Seigh
Guest





Posted: Thu Dec 16, 2004 7:16 pm    Post subject: Re: CAS and LL/SC (was Re: High Level Assembler for MVS & VM Reply with quote

"Maciej W. Rozycki" wrote:
Quote:

Indeed. And with LL/SC you actually risk a livelock if you don't handle
contention properly.


I believe Herlihy took care of that with his definition of obstruction-free
which fixes the problem with using the term lock-free in conjuction with
some of the hardware synchronization primitives since someone will always
raise the point that there's some contention level at which they won't work.

Joe Seigh
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, ... 24, 25, 26  Next
Page 2 of 26

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB