Thread Stacks
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Thread Stacks
Goto page Previous  1, 2, 3, 4  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Peter \"Firefly\" Lund
Guest





Posted: Thu Dec 01, 2005 11:43 pm    Post subject: Re: Thread Stacks Reply with quote

On Thu, 1 Dec 2005, Ian Rogers wrote:

Quote:
Linux was using fs selector to address thread local storage (TLS) for a
while, but now it does it through the paging system, which has many
advantages.

Linux uses the gs register for TLS, at least in NPTL (Native POSIX
Thread Library). It does NOT use the paging system. TLS in Linux is
documented in this document:

http://people.redhat.com/drepper/tls.pdf

I also made a quick little test:

$ cat test1.c
#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>

__thread int x;

int f()
{
return x;
}

int main()
{
printf("x: %d\n", f());
return EXIT_SUCCESS;
}
$ gcc -fomit-frame-pointer test1.c
$ objdump -M intel -d a.out | grep '<f>:' -A2
08048398 <f>:
8048398: 65 a1 fc ff ff ff mov eax,gs:0xfffffffc
804839e: c3 ret
$

(Ubuntu Breezy Badger)


Windows does something similar, except that it uses fs. I think gs was
chosen on Linux in order to make wine/libwine work with NPTL TLS with as
little fuss as possible.

Quote:
Quite a few systems (well I'm thinking JVMs) need places to hold
onto thread local variables, commiting them to a register wastes the register
so putting them at constant offsets addressed via a selector is something
that has been talked about being done (you could just hijack an existing TLS
except if you need TLS for green threads).

You could always allocate a selector value of your own and load it into
fs/gs/es when necessary. Linux lets you allocate a few selectors with
per-thread private base/limit values. I think three are available in all
2.5+ versions (but some have a few more).

In NPTL, there is a 1:1 mapping between userspace threads and kernel-level
"execution contexts" (call them threads or processes). As part of each
context switch, a small part of the global descriptor table gets copied
from the process control block and the descriptor table gets reloaded.

-Peter
Back to top
Brian Hurt
Guest





Posted: Fri Dec 02, 2005 1:15 am    Post subject: Re: Thread Stacks Reply with quote

George Neuner <gneuner2/@comcast.net> writes:

Quote:
On 32-bit Intel or AMD you could (ab)use MMU segments ... the logical
address space is 64TB IIRC. Of course the physical address space is
only 4GB so you would also have to maintain alternate page table sets
and swap them as necessary.

You still have a limit of 4G of address space per process. Worse yet,
every segment has to be contiguous.

Brian
Back to top
Jeremy Linton
Guest





Posted: Fri Dec 02, 2005 1:15 am    Post subject: Re: Thread Stacks Reply with quote

Eric P. wrote:
Quote:
The reserve stack space is recorded a b-tree of reserved address
ranges in entries called Virtual Address Descriptors, and does
not require any changes the page table to reserve space.
Yah cause its just being reserved... Check out the kernel section objects.

Committing stack requires changing the PTE's to be demand zero pages,
but does not actually allocate physical memory. When the pages are
next touched a zero'ed page frame is assigned.
Yup..

Touching a reserved but not committed page causes an access violation.
According to the documentation and experiment, changing reserved into
committed pages **requires** the _chkstk probes. The OS, Win32 and
the C-RTL work together to make it function correctly.
Yah, you should in theory be able to trap the guard page exceptions and

write your own version.... I just wasted an hour looking for the code
that is trapping the stack guard and issuing the VirtualAlloc() or
whatever its doing. Either i'm missing something or I didn't look long
enough. Not only that but the 30 second program I just wrote to trap the
guard page exceptions isn't working. Not sure if its me, or something
i'm missing. M$'s example for working with guard pages doesn't have code
to trap the exception.


Quote:
No.
#pragma check_stack(off) and /Gs only work in DOS & Win3.1
In Win32 these switches are [mostly] ignored and you always get
stack checks when the local vars > 4 KB.

See Q100775 Stack Checking for Windows NT-based Applications.
Yah my mistake, that's what I get for just remembering that was there

and looking it up..

Quote:

It basically says that _chkstk is *required* for NT to work correctly
and if you just subtract from ESP and touch memory, your code breaks.

For proof, try the following. It will get Access Violation:

void foo (void)
{ // Move down 5 pages in stack and write
_asm { sub esp, 0x5000 };
_asm { push eax }
}
Actually, this works fine on my machine, bumping it to 50 pages does

not, so I get your point. On the other hand if you do a CreateThread()
and commit all the stack memory you shouldn't have this problem, and it
will just work. So, it should really only be a problem for the main thread.



Quote:
Yes, that is what I said. Well, you were more polite.
I called it Dumb Dumb Dumb.
It does seem a little more complicated and slow than it needs to be. I

just compared the code between vs6 32-bit and vs2005's x64 chkstk
routine. At least in the new version someone is squirreling (probably
the exception handler I can't find) the current stack bottom address in
gs:[0x10] and reading it before doing the page touching. This is better
than what vs6 is doing which is always touching all the pages in
allocated space every time an allocation occurred.


Quote:
The whole thing gives me an ugly feeling that it works by luck
rather than design.
Well, that is probably true of a lot of code out there, they simply

haven't stressed it enough to overflow the stack. This is probably true
on nearly every operating system out there written in C. Which brings us
back full circle to the question of using non linear stacks, discussed
elsewhere in this thread.

BTW: I considered writing some code to try to switch the stack to a
normal VirtualAlloc'ed region, but I don't really see the point since
the CreateThread allows you to commit all the memory during the initial
call.
Back to top
Jeremy Linton
Guest





Posted: Fri Dec 02, 2005 1:15 am    Post subject: Re: Thread Stacks Reply with quote

Ian Rogers wrote:

Quote:
George Neuner wrote:

Those are great ideas ... you should have had them 20 years ago when
it mattered. Nobody much used the segment unit so Intel deprecated it
and in later IA32 models removed some of the cache hardware that
supported it. It's very expensive now to change a segment register in
32-bit protected mode. The IA64 MMU has completely done away with
the segment unit.

PowerPC has segmentation on the top 4 bits of the address (IIRC). You
can disable this for either instruction or data accesses. I think this
makes it conceivable to have separate 4GB instruction and data regions,
so 8GB total, which is potentially useful in emulation. Its odd that
PowerPC can address more in a 32bit user process than IA32 which seems
to commit more resources to the job. I still think 32bit linear
addresses are a shame :-)
First, PPC had a funny naming convention, they called the virtual

address (used by user programs) the Effective address, and the linear
address the Virtual address. I'm going to ignore their names.
As far as accessing more than 4G in 32-bit mode, I don't think so. The
32-bit PPC had 16 segment registers which held the translation to the
linear address, which segment register was selected by the top 4 bits of
the virtual address. It had nothing to do with code vs data, it was
simply a virtual address had the top 4 bits removed to select the
segment register. The segment register returned a 24 bit Virtual segment
ID (VSID) which replaced the upper 4 bits for purposes of page table
lookups which were done with the 24bit VSID and the 16bits of the
virtual address below the 4 bit segment selector.
This is still a 32-bit virtual address space, same as the x86. The
64-bit version of the PPC behaves almost the same way except that the
top 36 bits of a 64-bit address are translated through the SLB instead
of the segment registers. The SLB returns a 52bit VSID which gets
concatenated to the same 16 bits as before.

Two google hits for reference.
http://www.cs.wisc.edu/~remzi/Classes/537/Fall2005/Lectures/lecture14.ppt
http://www.cs.cmu.edu/~412/projects/9mac/PowerPC_Memory.ppt
Back to top
Jeremy Linton
Guest





Posted: Fri Dec 02, 2005 9:15 am    Post subject: Re: Thread Stacks Reply with quote

Jeremy Linton wrote:
Quote:
Ian Rogers wrote:
PowerPC has segmentation on the top 4 bits of the address (IIRC). You
can disable this for either instruction or data accesses. I think this
makes it conceivable to have separate 4GB instruction and data
regions, so 8GB total, which is potentially useful in emulation. Its
odd that PowerPC can address more in a 32bit user process than IA32
which seems to commit more resources to the job. I still think 32bit
linear addresses are a shame :-)

This is still a 32-bit virtual address space, same as the x86. The
64-bit version of the PPC behaves almost the same way except that the
top 36 bits of a 64-bit address are translated through the SLB instead
Oh, i guess there are a couple of really ugly hacks to get more... You

could just run with either instruction or data address translation
turned off (Which is what you said but i read it diffrenty the first
time around). In other words you run with MSR:ir and MSR:dr in diffrent
states. This could give you more than 4G in a funky kind of way. If you
ran with data translation off it would be really hard to protect your
memory from another process. On the other hand you migh be able to get
away with running with instruction address translation turned off but
you would have a limititation of not being able to have more than 4G of
total instruction space accross all processes. It still wouldn't look
like more than 4G, just that instruction fetches would go directly to
physical memory. It would be an absoulte nightmare to maintain, plus you
would have to dedicate the first 4G of phsical RAM, and you wouldn't be
able to demand page those addresses. It might be possible but it hurts
to think about.
Its possible you could get a similar affect with just changing the
IBAT's for each process and be able to access something other than the
first 4G. Still it would be really strange to deal with. You could be
executing code at one virtual address and fetching data from the same
virtual address and getting diffrent data returned.
Back to top
Eric P.
Guest





Posted: Sat Dec 03, 2005 12:14 am    Post subject: Re: Thread Stacks Reply with quote

Jeremy Linton wrote:
Quote:

Eric P. wrote:

Touching a reserved but not committed page causes an access violation.
According to the documentation and experiment, changing reserved into
committed pages **requires** the _chkstk probes. The OS, Win32 and
the C-RTL work together to make it function correctly.
Yah, you should in theory be able to trap the guard page exceptions and
write your own version.... I just wasted an hour looking for the code
that is trapping the stack guard and issuing the VirtualAlloc() or
whatever its doing. Either i'm missing something or I didn't look long
enough. Not only that but the 30 second program I just wrote to trap the
guard page exceptions isn't working. Not sure if its me, or something
i'm missing. M$'s example for working with guard pages doesn't have code
to trap the exception.

I tried the same thing, and also could not find it. That was why I
had that 'kernel(?)' reference in my original description.
Like I said earlier, despite the documentation hinting that it is
is handled in Win32, I suspect it is picked off inside the kernel.

Quote:
I called it Dumb Dumb Dumb.
It does seem a little more complicated and slow than it needs to be. I
just compared the code between vs6 32-bit and vs2005's x64 chkstk
routine. At least in the new version someone is squirreling (probably
the exception handler I can't find) the current stack bottom address in
gs:[0x10] and reading it before doing the page touching. This is better
than what vs6 is doing which is always touching all the pages in
allocated space every time an allocation occurred.

Yeah. It is good that it at least checks the low water mark now,
as I suggested earlier. But it is still very inefficient because
it touches every page one by one. This causes a real demand zero
page fault, and real frame allocation, for every page when it
should be just committing a whole range instead.

Just imagine if you had a really large array, say hundreds of MB,
allocated on the stack so your code would be reentrant.
The current approach kills large application performance.

It would at least be vastly more efficient if chkstk just did a
bounds check against the stack lower bound and touched the lower
guard page if out of bounds. Takes 3 instructions.
To perform automatic commit stack expansion, when a Reserved
page is touched it causes an access violation. A First Chance
Exception Handler would catch it and use VirtualAlloc to expand
the whole commit range into reserve space up to the touched page.

Ideally, for maximum efficiency the above is done automatically by
the OS's Page Fault Handler code and skip the unnecessary exception
and subsequent system service call. For that to work, the stack
memory would need to be distinguished from other by having its own
memory section type, a Stack Section, with its own set of behavior.
I believe this is easily justified because thread stacks are so
central to application execution and critical to performance.

Eric
Back to top
Eric P.
Guest





Posted: Sat Dec 03, 2005 12:38 am    Post subject: Re: Thread Stacks Reply with quote

"Eric P." wrote:
Quote:

Dave Hansen wrote:

What, then, is the purpose of the dwStackSize parameter to the Win32
function CreateThread?

It sets the Commit size. I have no idea why MS thinks I would want
to since the commit pages are automatically expanded.

BTW, I thought of one reason why having Commit size in the
CreateThread arg list is important, beyond the efficiency issue.

It is also a reliability issue. It ensures that resources,
specifically page file backing store, is allocated at thread
create time. If CreateThread succeeds, you are sure that you
have enough stack space to run.

A reserved allocation will not be checked until converted to commit
and then might be denied. That could cause unexpected failures.

So for reliable execution, in an ideal world both Reserve and
Commit space would be specified in the CreateThread arg list
(as they apparently are, internally).

Eric
Back to top
robertwessel2@yahoo.com
Guest





Posted: Sat Dec 03, 2005 1:15 am    Post subject: Re: Thread Stacks Reply with quote

Eric P. wrote:
Quote:
I have had situations where I know that my threads do not require
1 MB. But I cannot lower the stack size for just those threads,
and if I change the global default it may affect all threads,
even ones my code did not create.

Based on this experience I conclude that it would have been better
design for CreateThread to require the caller to specify the both
commit & reserve as arguments for each thread.

The global process defaults look easy, but seem just plain dangerous
to me because they allow stacks to be adjusted without regard to usage.


If you can live with a WinXP dependency, you can specify
STACK_SIZE_PARAM_IS_A_RESERVATION on CreateThread(), which causes the
stack size parameter to be treated as, ahem, a reservation request and
not a commit request.

Interestingingly CreateFiberEx() allows you to specify both the reserve
and commit size for the stack.
Back to top
Andy Glew
Guest





Posted: Sat Dec 03, 2005 5:00 pm    Post subject: Re: Thread Stacks Reply with quote

Quote:
Erlang doesn't have intra-stack data pointers, so stack relocation works.
For languages like C/C++ you should probably just use segmented/linked stacks.

So, what dos the CALL sequence look like for a segmented/linked stack:

High-level CALL
If there is room for another stack frame in current segment
then
Frame <- StackPointer
StackPointer -= FrameSize
OldFrame.NextFrameLink <- Frame
Else
Allocate NextStackSegment (e.g. get off a linked freelist)
CurrentStackSegment.next <- NextStackSegment
Frame <- NextStacksegment.start
StackPointer = Frame - FrameSize
OldFrame.NextFrameLink <- Frame

similarly for a high level return,
assuming that pointers to stack variables are handled
elsewhere

This seems a lot more expensive than existing CALL/RET instructions.

Should there be instructions that accomplish this?
Or should it all be software convention?
Back to top
Mikael Pettersson
Guest





Posted: Sat Dec 03, 2005 5:15 pm    Post subject: Re: Thread Stacks Reply with quote

In article <peyphd9qo36h.fsf@pxpl2829.amr.corp.intel.com>,
Andy Glew <andy.glew@intel.com> wrote:
Quote:
Erlang doesn't have intra-stack data pointers, so stack relocation works.
For languages like C/C++ you should probably just use segmented/linked stacks.

So, what dos the CALL sequence look like for a segmented/linked stack:

High-level CALL
If there is room for another stack frame in current segment
then
Frame <- StackPointer
StackPointer -= FrameSize
OldFrame.NextFrameLink <- Frame
Else
Allocate NextStackSegment (e.g. get off a linked freelist)
CurrentStackSegment.next <- NextStackSegment
Frame <- NextStacksegment.start
StackPointer = Frame - FrameSize
OldFrame.NextFrameLink <- Frame

similarly for a high level return,
assuming that pointers to stack variables are handled
elsewhere

This seems a lot more expensive than existing CALL/RET instructions.

Should there be instructions that accomplish this?
Or should it all be software convention?

You do call/ret as usual.

In a function's prologue you check if you have enough space for
your own frame plus the minimal guarantee[1] you give other functions
you call (if any). If you ran out of space, allocate a new segment,
push a trap continuation (current return address and previous segment's
stack pointer), then reinvoke yourself with a trap return address.
At return the trap takes care of switching back to the previous segment.

Since the caller and callee may run in different stack segments,
the compiler should use an argument pointer register for addressing
actual parameters in the caller's frame, or the switch to a new
segment should copy the stacked parameters to the new segment.

[1] Callees must have a few words available for storing return
address + argument registers before invoking the stack overflow
handler.
--
Mikael Pettersson (mikpe@csd.uu.se)
Computing Science Department, Uppsala University
Back to top
Joe Seigh
Guest





Posted: Sat Dec 03, 2005 5:15 pm    Post subject: Re: Thread Stacks Reply with quote

Mikael Pettersson wrote:
Quote:
In article <peyphd9qo36h.fsf@pxpl2829.amr.corp.intel.com>,
Andy Glew <andy.glew@intel.com> wrote:
[...]

This seems a lot more expensive than existing CALL/RET instructions.

Should there be instructions that accomplish this?
Or should it all be software convention?


You do call/ret as usual.

In a function's prologue you check if you have enough space for
your own frame plus the minimal guarantee[1] you give other functions
you call (if any). If you ran out of space, allocate a new segment,
push a trap continuation (current return address and previous segment's
stack pointer), then reinvoke yourself with a trap return address.
At return the trap takes care of switching back to the previous segment.

Since the caller and callee may run in different stack segments,
the compiler should use an argument pointer register for addressing
actual parameters in the caller's frame, or the switch to a new
segment should copy the stacked parameters to the new segment.

z/OS (s360) linkage uses register 1 for the paramater list so I didn't
have to worry about that when I used segmented stacks for it.

You can get rid of most of the need for stack segment allocation and
deallocation from the free pool. Just have a per processor stack
segment free pool and append that to the current stack segment on
thread dispatch. On thread suspend, detach the unused stack segments
from the current stack segment and use those for the next dispatched
thread. No need to use dummy trap returns to free stack segments.
You'd still have to check if you ran out of free stack segments. The
dispatcher can check if free pool gets too big and stack segments need
to be deallocated.


--
Joe Seigh

When you get lemons, you make lemonade.
When you get hardware, you make software.
Back to top
George Neuner
Guest





Posted: Sat Dec 03, 2005 5:15 pm    Post subject: Re: Thread Stacks Reply with quote

On Thu, 01 Dec 2005 21:01:43 -0000, Brian Hurt <bhurt@AUTO> wrote:

Quote:
George Neuner <gneuner2/@comcast.net> writes:

On 32-bit Intel or AMD you could (ab)use MMU segments ... the logical
address space is 64TB IIRC. Of course the physical address space is
only 4GB so you would also have to maintain alternate page table sets
and swap them as necessary.

You still have a limit of 4G of address space per process. Worse yet,
every segment has to be contiguous.


Only logically contiguous. The segments are still paged and you can
virtualize the page tables if you really want to. I'm certainly not
going to claim it's easy to do though.

George
--
for email reply remove "/" from address
Back to top
George Neuner
Guest





Posted: Sat Dec 03, 2005 5:15 pm    Post subject: Re: Thread Stacks Reply with quote

On Thu, 01 Dec 2005 10:50:39 +0000, Ian Rogers
<ian.rogers@manchester.ac.uk> wrote:

Quote:
George Neuner wrote:
Those are great ideas ... you should have had them 20 years ago when
it mattered. Nobody much used the segment unit so Intel deprecated it
and in later IA32 models removed some of the cache hardware that
supported it. It's very expensive now to change a segment register in
32-bit protected mode. The IA64 MMU has completely done away with
the segment unit.

Linux was using fs selector to address thread local storage (TLS) for a
while, but now it does it through the paging system, which has many
advantages. Quite a few systems (well I'm thinking JVMs) need places to
hold onto thread local variables, commiting them to a register wastes
the register so putting them at constant offsets addressed via a
selector is something that has been talked about being done (you could
just hijack an existing TLS except if you need TLS for green threads).

Using multiple segments is fine so long as you are not constantly
changing them - you take a hit on each selector register load.

Even reloading the same value into a segment register takes a small
hit - the value is already known to be a valid selector so it doesn't
need to be validated again, but the limits and privileges have to
reloaded from the cache and applied to the MMU. If you change the
register value and the selector isn't in the cache you take a huge hit
to do the validation check and application.

George
--
for email reply remove "/" from address
Back to top
Andy Glew
Guest





Posted: Sun Dec 04, 2005 12:21 am    Post subject: Re: Thread Stacks Reply with quote

Quote:
"Eric P." <eric_pattison@sympaticoREMOVE.ca> writes:

I have had situations where I know that my threads do not require
1 MB. But I cannot lower the stack size for just those threads,
and if I change the global default it may affect all threads,
even ones my code did not create.

Based on this experience I conclude that it would have been better
design for CreateThread to require the caller to specify the both
commit & reserve as arguments for each thread.

You talk as if you come from an embedded world, where the programmer
knows what each thread will do and need.

I think, in the general purpose world, this is NOT true.

E.g. Joe Programmer writes a piece of code. He knows that most of the
time is spent in his code, and that his code is parallelizable, so he
wants to use threads.

However, in several places Joe is calling a library routine. Maybe
printf; maybe some other. Maybe an STL function He doesn't think that
the library routine matters much. It isn't really worth the trouble to
collapse all the threads via a barrier, and serialize, at every
library routine.

But Joe doesn't know how much stack the library routine uses; maybe
typically, but certainly not in all cases.

I think that it is unreasonable for Joe to have to specify the amount
of memory for the thread stacks (or any other property of the library
routine).

Making that requirement basically means that Joe should not use
library routines in his threads. If he does, his code may break if the
library routine changes its stack usage.

---

I was inspired by a talk Jack Dennis gave at PACT in Paris in 1997(?):

Jack said that the problem with parallel programming is that it
violates all of the rules of software engineering. E.g. modularity:
you're NOT SUPPOSED to know how much stack a library routine is using.

I suggest that, so long as parallel programming requires the top level
programmer to know all sorts of intimate details about the code he is
calling, parallel programming will remain a niche. Maybe an important
niche, but a niche nevertheless.

So: how do you make parallel programming compatible with software
engineering principles like modularity?
Back to top
Rupert Pigott
Guest





Posted: Sun Dec 04, 2005 12:55 am    Post subject: Re: Thread Stacks Reply with quote

Andy Glew wrote:

[SNIP]

Quote:
I suggest that, so long as parallel programming requires the top level
programmer to know all sorts of intimate details about the code he is
calling, parallel programming will remain a niche. Maybe an important
niche, but a niche nevertheless.

Parallel programming hasn't required that kind of knowledge for
donkey's
years. Threads will *always* require that kind of knowledge, and that
is
why they suck.

Fine grain parallelism just doesn't scale well, it is a dead end.

Quote:
So: how do you make parallel programming compatible with software
engineering principles like modularity?

Do I need to say CSP again ? :/

CSP enforces modularity, that is one of the main points behind it.

Cheers,
Rupert
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4  Next
Page 3 of 4

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB