| Author |
Message |
Oliver S.
Guest
|
Posted:
Sun Oct 23, 2005 7:20 am Post subject:
per-processor synchronization idea |
|
|
I'm just finishing my work on my memory allocator and I had some further
interesting ideas about synchronziation. My allocator is very aggressive
in thread-local pooling and there might be circumstances where the sum of
all per-thread-pools for small blocks are quite large. So the idea of a per
-processor pool crossed my mind.
But this includes a problem: The kernel would have to support addtional infor-
mation in the thread-local-storage region to indicate processor-local storage.
The allocator would have to get this information from the TLS-block (on Win32
/x86 this is done to access through the fs-segment) to lock the processor-spe-
cific memory-pool with a usual mutex-guards. The thread might get migrated to
another CPU, but we would keep our "processor-local" heap locked such a short
time, that migration while holding this lock might be very seldom. So in most
cases we would run throgh this process on the same processor and we would get
the benefits of processor-local memory-access, i.e. there usually won't be
any cacheline-migration.
We might stick with that idea, but there's another unfortunate point with
the locking of the processor-specific heap. Compared to what we will do while
we will hold that lock, locking will be an expensive operation a least on x86
-systems because a CAS with a lock-prefix relatively slow (on my Athlon about
three times slower and I wouldn't wonder if it would be even slower on Intel's
P4-CPUs). So if we could prevent any migration of our thread and all other
threads operating on that lock could do the same, we could use the CAS without
a lock-prefix! This would be quite simple to support for the OS' scheduler by
a simple per-thread flag in the TLS-area. Once this flag is set by a thread,
this would prohibit the scheduler to migtrate that thread to another CPU. If
the thread would get the processor-specific-storage pointer from the tls-aray,
it could go ahead and use unlocked CASing without caring for CPU-migration.
That's not the only idea I had on processor-specific-storage and thread-lock-
ing, but what do you think of this? |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Sun Oct 23, 2005 7:20 am Post subject:
Re: per-processor synchronization idea |
|
|
| Quote: | I'm just finishing my work on my memory allocator and I had some further
interesting ideas about synchronziation. My allocator is very aggressive
in thread-local pooling and there might be circumstances where the sum of
all per-thread-pools for small blocks are quite large. So the idea of a
per
-processor pool crossed my mind.
|
Hoard uses per-processor heaps. I think linux used something where each
processor had its own mutex, and readers would just lock the processor's
mutex to gain access to a shared data-structure. Writers would lock each
processors mutex in order. I forgot what they called the scheme. |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Sun Oct 23, 2005 7:20 am Post subject:
Re: per-processor synchronization idea |
|
|
| Quote: | So if we could prevent any migration of our thread and all other
threads operating on that lock could do the same, we could use the CAS
without
a lock-prefix!
[...]
That's not the only idea I had on processor-specific-storage and
thread-lock-
ing, but what do you think of this?
|
Not sure if this would be "safe". I need to think about it some more... |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Sun Oct 23, 2005 8:15 am Post subject:
Re: per-processor synchronization idea |
|
|
| Quote: | Hoard uses per-processor heaps.
|
I also read this; but I wondered how HOARD does implement this because
there is no way to implement per-processor-storage at least on Win32. |
|
| Back to top |
|
 |
Joe Seigh
Guest
|
Posted:
Sun Oct 23, 2005 4:15 pm Post subject:
Re: per-processor synchronization idea |
|
|
Oliver S. wrote:
| Quote: | I'm just finishing my work on my memory allocator and I had some further
interesting ideas about synchronziation. My allocator is very aggressive
in thread-local pooling and there might be circumstances where the sum of
all per-thread-pools for small blocks are quite large. So the idea of a per
-processor pool crossed my mind.
[...] |
| Quote: | That's not the only idea I had on processor-specific-storage and
thread-lock-
ing, but what do you think of this?
|
Besides the temporary processor binding solution there are other approaches.
One is something I suggested on c.a. previously, an interrupt/signal on
re-dispatch after preemption. So basically you can have a thread examining
processor local storage and if a processor switch occurs, the thread
automatically gets restarted to the begining of the "critical section".
You can't use locks with this scheme but lock-free works. Something like
that was seriously considered for Linux as part of an RCU for preemptive
user threads proposal. A certain kernel developer was too busy though.
Something you can do now is take advantage of the memory synchronization
that takes place on a context switch (two of which would have to occur
to get a thread from one processor to another. So you could use CAS
w/o membars to push onto a processor local queue. Even if you were on
a different processor when you executed the CAS, proper object visibility
would exist on the processor you orginally loaded the queue address from.
To pop from the queue, after you loaded the address of the object you
wish to dequeue, recheck what processor you are on and restart if different.
You need an ABA safe LIFO queue here. Should work with LL/SC also. This
is if dependent load memory ordering doesn't get you what you want in
the first place.
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software. |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Mon Oct 24, 2005 12:15 am Post subject:
Re: per-processor synchronization idea |
|
|
| Quote: | One is something I suggested on c.a. previously, an interrupt/signal on
re-dispatch after preemption. So basically you can have a thread
examining
processor local storage and if a processor switch occurs, the thread
automatically gets restarted to the begining of the "critical section".
|
IIRC, that was your user-space RCU proposal?
| Quote: | A certain kernel developer was too busy though.
|
:)
| Quote: | To pop from the queue, after you loaded the address of the object you
wish to dequeue, recheck what processor you are on and restart if
different.
You need an ABA safe LIFO queue here. Should work with LL/SC also. This
is if dependent load memory ordering doesn't get you what you want in
the first place.
|
Yeah. For those who don't know, a read_barrier_depends() is required right
after you load the address of the object you want to pop off a lock-free
lifo. |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Mon Oct 24, 2005 4:32 am Post subject:
Re: per-processor synchronization idea |
|
|
| Quote: | So if we could prevent any migration of our thread and all other
threads operating on that lock could do the same, we could use the CAS
without
a lock-prefix!
[...]
That's not the only idea I had on processor-specific-storage and
thread-lock-
ing, but what do you think of this?
Not sure if this would be "safe". I need to think about it some more...
|
Okay. I think I see what you are getting at. Multiple threads that are
"pinned to the same CPU" can skip the lock prefix on atomic operations that
affect "shared data residing in the same CPU's local-storage"... |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Mon Oct 24, 2005 8:15 am Post subject:
Re: per-processor synchronization idea |
|
|
Oops. Never mind. They are describing a Peterson-like mutex algorithm, which
doesn't use the lock prefix:
http://www.research.ibm.com/trl/projects/jit/paper/slide_TO-lock.pdf
Their code seems to depend on a load after a store to another location; I
didn't notice any mention of a memory barrier. They don't seem to mention
how they address load-after-store reordering. |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Mon Oct 24, 2005 8:15 am Post subject:
Re: per-processor synchronization idea |
|
|
| Quote: | Okay. I think I see what you are getting at. Multiple threads that are
"pinned to the same CPU" can skip the lock prefix on atomic operations
that affect "shared data residing in the same CPU's local-storage"...
|
Yes, and we're enabled to work with processor-specific storage. |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Mon Nov 07, 2005 2:21 am Post subject:
Re: per-processor synchronization idea |
|
|
"Oliver S." <Follow.Me@gmx.net> wrote in message
news:djf0qh$4ov$2@domitilla.aioe.org...
| Quote: | Hoard uses per-processor heaps.
I also read this; but I wondered how HOARD does implement this because
there is no way to implement per-processor-storage at least on Win32.
|
I guess you could use the following simple schema to "emulate" a crude
per-processor storage in windows:
#define CPU_DEPTH 4 // whatever
struct per_cpu_mutex_t
{
LONG state;
HANDLE wset;
};
struct per_cpu_t
{
per_cpu_mutex_t mutex;
DWORD_PTR affinity;
};
struct per_thread_t
{
HANDLE me;
per_cpu_t *cpu;
};
static void per_thread_bind_cpu( per_thread_t* );
static void per_thread_cpu_lock( per_thread_t* );
static void per_thread_cpu_unlock( per_thread_t* );
static per_cpu_t g_cpus[CPU_DEPTH];
void per_thread_bind_cpu( per_thread_t *_this )
{
_this->cpu = &g_cpus[HASH( _this )];
SetThreadAffinityMask( _this->me, _this->cpu->affinity );
}
void per_thread_cpu_lock( per_thread_t *_this )
{
per_cpu_mutex_t *cpu_mtx = &_this->cpu->mutex;
// cmpxchg wo/ lock prefix based mutex lock algo
}
void per_thread_cpu_unlock( per_thread_t *_this )
{
per_cpu_mutex_t *cpu_mtx = &_this->cpu->mutex;
// cmpxchg wo/ lock prefix based mutex unlock algo
}
What do you think Oliver? I believe Hoard uses something similar... |
|
| Back to top |
|
 |
|
|
|
|