Thread Stacks

General discussion of computer architecture.

Re: Thread Stacks

Postby Andy Glew » Sun Dec 04, 2005 1:15 am

Ian Rogers <ian.rogers@manchester.ac.uk> writes:
Linux was using fs selector to address thread local storage (TLS) for
a while, but now it does it through the paging system, which has many
advantages.

And disadvantages... In particular, that the threads are no longer
"completely shaed address space". Making it harder to share the page
tables and TLB entries and TLB ASIDs and TLB PIDs and ...

Further, beware of the bug that can occur when you pass a pointer
between threads in such an environment. You can nearly always do
this, *except* when the pointer is to the TLS. If you pass a TLS
pointer, in the receiving process it points to a different TLS.

Further... Typically you will have both the TLS mapped at a constant address,
and the TLS for all of the threads visible at some other address. Which
means that there will be at least one page that has aliased virtual addresses.
The TLS alias is probably most often accessed, but if you ever accessthe other,
and you are on a system that uses a virtual indexed cache, you may incur
expensive cache flushes when alising is detected.
Andy Glew
 

Re: Thread Stacks

Postby Andy Glew » Sun Dec 04, 2005 1:15 am

"Peter \"Firefly\" Lund" <firefly@diku.dk> writes:

On Thu, 1 Dec 2005, Ian Rogers wrote:

Linux was using fs selector to address thread local storage (TLS)
for a while, but now it does it through the paging system, which has
many advantages.

Linux uses the gs register for TLS, at least in NPTL (Native POSIX
Thread Library). It does NOT use the paging system. TLS in Linux is
documented in this document:

That's what I thought.

Nevertheless, I replied to Ian Rogers, pointing out some of the
problems of using paging for TLS.

Back in the old days various UNIXes used the same trick for the "u",
the per-process area. Same problems.
Andy Glew
 

Re: Thread Stacks

Postby Andy Glew » Sun Dec 04, 2005 1:15 am

mikpe@harpo.csd.uu.se (Mikael Pettersson) writes:

So, what dos the CALL sequence look like for a segmented/linked stack:

This seems a lot more expensive than existing CALL/RET instructions.

Should there be instructions that accomplish this?
Or should it all be software convention?

You do call/ret as usual.

In a function's prologue you check if you have enough space for
your own frame plus the minimal guarantee[1] you give other functions
you call (if any).

So, what is that minimal guarantee? For functions it's
straightforward enough... but you better make sure that you are not
executing any instructions that get trapped and emulated via user code
(not kernel code, that has its own stack) in that prologue. Better
make sure that the compiler isn;t scheduling code into that
prologue. I.e. make sure that the compiler treats that prologue as an
atomic unit.

What about signals? Can signals run on the user stack? Probably best
to make sure that they run on a different stack.

What about "events" in general? Must all event classes have dedicated
stacks?
Andy Glew
 

Re: Thread Stacks

Postby Eric P. » Sun Dec 04, 2005 1:15 am

Joe Seigh wrote:
Mikael Pettersson wrote:

You do call/ret as usual.

In a function's prologue you check if you have enough space for
your own frame plus the minimal guarantee[1] you give other functions
you call (if any). If you ran out of space, allocate a new segment,
push a trap continuation (current return address and previous segment's
stack pointer), then reinvoke yourself with a trap return address.
At return the trap takes care of switching back to the previous segment.

Since the caller and callee may run in different stack segments,
the compiler should use an argument pointer register for addressing
actual parameters in the caller's frame, or the switch to a new
segment should copy the stacked parameters to the new segment.

z/OS (s360) linkage uses register 1 for the paramater list so I didn't
have to worry about that when I used segmented stacks for it.

You can get rid of most of the need for stack segment allocation and
deallocation from the free pool. Just have a per processor stack
segment free pool and append that to the current stack segment on
thread dispatch. On thread suspend, detach the unused stack segments
from the current stack segment and use those for the next dispatched
thread. No need to use dummy trap returns to free stack segments.
You'd still have to check if you ran out of free stack segments. The
dispatcher can check if free pool gets too big and stack segments need
to be deallocated.

Off the top of my head, some other issues for consideration:

- Large allocation, larger than your pre allocated segment size.
You still need to decide whether to use a standard or custom
segment and release it to the appropriate heap/pool.
Just more things for the prolog and epilog to consider.

- Dynamic stack allocation, alloca, including multiple allocations
in a single routine, and their recovery upon return.
Not a difficult issue, but can't be forgotten.

- Non C or Fortran languages like Ada which allow variable
size function return values.

IOW functions that do not restore the stack pointer on return
so as to effect an allocation in the caller's frame.

- User mode interrupt or exception delivery, including imprecise
exceptions, delivered at the most inconvenient instant during
a stack expansion.

This would effect the decisions on how to chain the segments
(ie probably not a double linked list).

- And of course space recovery during exceptions' stack unwind
(including a partially linked segment that was being chained
just as the exception was delivered.)

- Setjmp & longjmp, and space recovery.

Not that these are insoluble.
Eric
Eric P.
 

Re: Thread Stacks

Postby Eric P. » Sun Dec 04, 2005 1:15 am

Andy Glew wrote:
"Eric P." <eric_pattison@sympaticoREMOVE.ca> writes:

I have had situations where I know that my threads do not require
1 MB. But I cannot lower the stack size for just those threads,
and if I change the global default it may affect all threads,
even ones my code did not create.

Based on this experience I conclude that it would have been better
design for CreateThread to require the caller to specify the both
commit & reserve as arguments for each thread.

You talk as if you come from an embedded world, where the programmer
knows what each thread will do and need.

I think, in the general purpose world, this is NOT true.

Actually, I did start in embedded hardware and software 30 years ago.
But in this case I was intentionally being a slight hyperbolist
in order to make a point about reliability.

E.g. Joe Programmer writes a piece of code. He knows that most of the
time is spent in his code, and that his code is parallelizable, so he
wants to use threads.

However, in several places Joe is calling a library routine. Maybe
printf; maybe some other. Maybe an STL function He doesn't think that
the library routine matters much. It isn't really worth the trouble to
collapse all the threads via a barrier, and serialize, at every
library routine.

But Joe doesn't know how much stack the library routine uses; maybe
typically, but certainly not in all cases.

I think that it is unreasonable for Joe to have to specify the amount
of memory for the thread stacks (or any other property of the library
routine).

Making that requirement basically means that Joe should not use
library routines in his threads. If he does, his code may break if the
library routine changes its stack usage.

All true. Before multi-threading the OS just allocated a huge
swath of on the order of 1 GB of virtual space to the program stack.
This was so huge that it allows programmers to ignore all stack space
considerations. You are more likely to get a "Page File Exhausted"
error than a stack overflow.

Then along comes 32 bit multi-threading and we can't have 1 GB
per thread any more. Many of the previous development practices,
specifically not giving *any* consideration to worst case stack use,
come back to haunt one when there is only 1 MB.

So the answer would seem to be: use a 64 bit cpu, commit 1 GB
of virtual space (and page file space) to every thread stack,
and stick the page file on a great whacking 1 TB disk.

Then we can all get back to writing 'error free' code.

I was inspired by a talk Jack Dennis gave at PACT in Paris in 1997(?):

Jack said that the problem with parallel programming is that it
violates all of the rules of software engineering. E.g. modularity:
you're NOT SUPPOSED to know how much stack a library routine is using.

I suggest that, so long as parallel programming requires the top level
programmer to know all sorts of intimate details about the code he is
calling, parallel programming will remain a niche. Maybe an important
niche, but a niche nevertheless.

So: how do you make parallel programming compatible with software
engineering principles like modularity?

Then the above approach would seem to fill the bill. :-)

I never agreed with that philosophy, that modularity implied absolute
ignorance, so I do not get disappointed. Furthermore, sometimes people
then use such ignorance to imply lack of personal responsibility
for the result.

I use modularity is to lower the number of variables under
consideration below my grossly stupid error point.
Reuse is a bonus.

Eric
Eric P.
 

Re: Thread Stacks

Postby Mikael Pettersson » Sun Dec 04, 2005 4:18 pm

In article <peypmzjim0jj.fsf@pxpl2829.amr.corp.intel.com>,
Andy Glew <andy.glew@intel.com> wrote:
mikpe@harpo.csd.uu.se (Mikael Pettersson) writes:

So, what dos the CALL sequence look like for a segmented/linked stack:

This seems a lot more expensive than existing CALL/RET instructions.

Should there be instructions that accomplish this?
Or should it all be software convention?

You do call/ret as usual.

In a function's prologue you check if you have enough space for
your own frame plus the minimal guarantee[1] you give other functions
you call (if any).

So, what is that minimal guarantee? For functions it's
straightforward enough... but you better make sure that you are not
executing any instructions that get trapped and emulated via user code
(not kernel code, that has its own stack) in that prologue. Better
make sure that the compiler isn;t scheduling code into that
prologue. I.e. make sure that the compiler treats that prologue as an
atomic unit.

The minimal guarantee must include enough space so your callees can
invoke the stack overflow handler. Typically the guarantee also
includes enough to allow most leaf procedures to run without any
stack overflow checks.

What about signals? Can signals run on the user stack? Probably best
to make sure that they run on a different stack.

Signals are a pain because the Unix model with signals running on
your current stack by default is horribly broken. On x86/amd64
our Erlang runtime system uses sigaltstack() to force all signal
handlers to a separate stack, but that requires libc-specific code
since we also have to catch and override sigaction() calls in libc
and other libraries. It's doable but very icky.

The only alternative is to either waste a GPR for a simulated stack
(which kills performance on x86/amd64 since you then also lose
return stack branch prediction), or to always have another page
or so of "slack" at the bottom of the stack for signal handlers.
--
Mikael Pettersson (mikpe@csd.uu.se)
Computing Science Department, Uppsala University
Mikael Pettersson
 

Re: Thread Stacks

Postby Joe Seigh » Sun Dec 04, 2005 5:15 pm

Andy Glew wrote:
What about signals? Can signals run on the user stack? Probably best
to make sure that they run on a different stack.

To say that signals (in unix) are problematic is a wild understatement.
Unix signals are an anachronism that predate threads and are all but
impossible to have interact correctly with threads assuming that you
care about niggling little details like program correctness. If you
look at them as a really primative form of user level threading, the
fact that all the major unix vendors have abandoned user level threads
in favor of kernel level threads should tell you something.

So whatever stack solution you consider, its effect of signal handling
efficiency should not be be a major consideration. In fact if you can
degrade signal handling somewhat then it would be a good thing if it
discourage use of signals.

Someone will point out that there is a need for low latency dispatching
of high priority "tasks" but signals are not the only way to accomplish
that and are a really bad way to do it.


What about "events" in general? Must all event classes have dedicated
stacks?


What do you mean by "events"?

--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Joe Seigh
 

Previous

Return to Computer Architecture

Who is online

Users browsing this forum: No registered users and 1 guest