| Author |
Message |
Oliver S.
Guest
|
Posted:
Tue Oct 11, 2005 7:39 am Post subject:
hyperthreading in database-benchmarks |
|
|
Has anyone found information on how much hyperthreading is able to improve the
performance of database-workloads (OTP as well as DWH)? As far as I know, data-
base systems have a un-usually high rate of cache misses and though suffer more
the latency-"problem" of current memory-subsystems. So I guess, databases will
profit more from hyperthreading than most other appications.
And to get some synthetic numbers on how hyperthreading is able to partitially
compensate the latency-problem, it would make sense to run a memory-latency
(with the usual pointer-chasing method) benchmark with two thrads; has anyone
found some numbers on this? |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Tue Oct 11, 2005 7:39 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
| Quote: | Has anyone found information on how much hyperthreading is able to improve
the
performance of database-workloads (OTP as well as DWH)? As far as I know,
data-
base systems have a un-usually high rate of cache misses and though suffer
more
the latency-"problem" of current memory-subsystems. So I guess, databases
will
profit more from hyperthreading than most other appications.
|
It depends on application design. Two threads on the same core are tightly
coupled together. You would need to try to ensure that reader threads that
are bound to a single SMT core access similar data, and preferably execute
down similar code-paths. Cache locality is very important, false-sharing,
cache-blocking, ect...
SMT can reduce performance if an application is not properly designed.
Usually, lock-free designs that avoid atomic operations and StoreLoad style
memory barriers are able to make "some" performance gains on an SMT
system... |
|
| Back to top |
|
 |
Del Cecchi
Guest
|
Posted:
Tue Oct 11, 2005 8:15 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
"Oliver S." <Follow.Me@gmx.net> wrote in message
news:434b25a3$0$64081$892e7fe2@authen.white.readfreenews.net...
| Quote: | Has anyone found information on how much hyperthreading is able to
improve the
performance of database-workloads (OTP as well as DWH)? As far as I
know, data-
base systems have a un-usually high rate of cache misses and though
suffer more
the latency-"problem" of current memory-subsystems. So I guess,
databases will
profit more from hyperthreading than most other appications.
And to get some synthetic numbers on how hyperthreading is able to
partitially
compensate the latency-problem, it would make sense to run a
memory-latency
(with the usual pointer-chasing method) benchmark with two thrads; has
anyone
found some numbers on this?
You could look in the power5 issue I referenced in another thread. I |
recall an article about hyperthreading. |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Tue Oct 11, 2005 1:52 pm Post subject:
Re: hyperthreading in database-benchmarks |
|
|
| Quote: | It depends on application design.
Two threads on the same core are tightly coupled together.
|
I don't want to talk about SMT in general, but SMT in secarios where
a lot of load-stalls occur in each thread.
| Quote: | You would need to try to ensure that reader threads that are bound
to a single SMT core access similar data, and preferably execute
down similar code-paths. Cache locality is very important,
false-sharing, cache-blocking, ect...
Usually, lock-free designs that avoid atomic operations and StoreLoad
style memory barriers are able to make "some" performance gains on an
SMT system...
|
Aaaaah, it's unbelievable - you are such an IDIOT!
You're again mis-using a thread with very common blah-blah to demonstrate
your lock-free and thrading competencies and not to really contribute to
the discussion. Please shut up until you understood the constraints of
database-engines. |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Tue Oct 11, 2005 2:14 pm Post subject:
Re: hyperthreading in database-benchmarks |
|
|
"Oliver S." <Follow.Me@gmx.net> wrote in message
news:434b7d0d$0$66453$892e7fe2@authen.white.readfreenews.net...
| Quote: | It depends on application design.
Two threads on the same core are tightly coupled together.
I don't want to talk about SMT in general, but SMT in secarios where
a lot of load-stalls occur in each thread.
You would need to try to ensure that reader threads that are bound
to a single SMT core access similar data, and preferably execute
down similar code-paths. Cache locality is very important, false-sharing,
cache-blocking, ect...
Usually, lock-free designs that avoid atomic operations and StoreLoad
style memory barriers are able to make "some" performance gains on an
SMT system...
Aaaaah, it's unbelievable - you are such an IDIOT!
You're again mis-using a thread with very common blah-blah to demonstrate
your lock-free and thrading competencies and not to really contribute to
the discussion. Please shut up until you understood the constraints of
database-engines.
|
http://groups.google.com/group/comp.arch/msg/2f40a2aaeb8cacb6?hl=en
:O |
|
| Back to top |
|
 |
Joe Seigh
Guest
|
Posted:
Tue Oct 11, 2005 4:15 pm Post subject:
Re: hyperthreading in database-benchmarks |
|
|
Oliver S. wrote:
| Quote: | Has anyone found information on how much hyperthreading is able to
improve the
performance of database-workloads (OTP as well as DWH)? As far as I
know, data-
base systems have a un-usually high rate of cache misses and though
suffer more
the latency-"problem" of current memory-subsystems. So I guess,
databases will
profit more from hyperthreading than most other appications.
And to get some synthetic numbers on how hyperthreading is able to
partitially
compensate the latency-problem, it would make sense to run a memory-latency
(with the usual pointer-chasing method) benchmark with two thrads; has
anyone
found some numbers on this?
|
I assume you're talking about running queries which would sequentially
scan memory which isn't cache's strong point since it's LRU optimized.
And it doesn't look like the mfgrs are doubling up on cache for these systems,
so no help here.
You could get help running a hardware scount thread or ganging the queries so
they act as hardware scouts for each other getting some synergism out of the
process. There probably won't be gang scheduling support from any of the
OSes for a while at least due to the instability of hw design at this point.
Where are the cache hits occurring? On the index traversals or on the table
data itself or on both?
--
Joe Seigh
When you get lemons, you make lemonade.
When you get hardware, you make software. |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Tue Oct 11, 2005 9:33 pm Post subject:
Re: hyperthreading in database-benchmarks |
|
|
[quote]I assume you're talking about running queries which would sequentially
scan memory which isn't cache's strong point since it's LRU optimized.
[/quote]
Of course.
[quote]And it doesn't look like the mfgrs are doubling up on cache for these
systems, so no help here.
[/quote]
All query-workloads that fit within any cache of a current CPU-architecture
aren't worth to be mentioned because they're small anyway. I'm thinking of
a huge load of oltp-clients running a lot of query-threads or a small load
of dwl-clients running querys on large data-volumes.
[quote]You could get help running a hardware scount thread or ganging the queries
so they act as hardware scouts for each other getting some synergism out
of the process.
[/quote]
I think scouting is a nice idea, but if you have some threads on each core
which will fill each other's stalls, the effect of scouting will become very
small.
[quote]There probably won't be gang scheduling support from any of the OSes
for a while at least due to the instability of hw design at this point.
[/quote]
I don't think scheduling is a point here because with large data-sets and
appropriate read-ahead, the query-threads wouldn't give up their time-slices
by doing I/O very often.
[quote]Where are the cache hits occurring?
On the index traversals or on the table data itself or on both?
[/quote]
On both of course, but there are two different scenarios: With dwh-workloads,
there are a lot of linear accesses to the indices and the blocks of the tables;
so linear memory-performance becomes prevalent here and prefetching which is
more simple than scouting will help. With oltp-workloads, the parts fetched
from the data-blocks become more scattered and prefetching won't help a lot.
BTW: I think I'll do a little benchmark with two threads doing pointer-chasing
the next days. So we'll see what hyperthreading can do in this worst-case
cited often by SMT-advocates.
BTW2: I'm currently writing a high-performance general-purpose memory-allocator;
I'm pretty sure it will outperform most current allocators on small block
sizes. |
|
| Back to top |
|
 |
Guest
|
Posted:
Tue Oct 11, 2005 9:34 pm Post subject:
Re: hyperthreading in database-benchmarks |
|
|
"Chris Thomasson" <_no_spam_cristom@no_spam_comcast._net> writes:
[quote]"Oliver S." <Follow.Me@gmx.net> wrote in message
news:434b7d0d$0$66453$892e7fe2@authen.white.readfreenews.net...
It depends on application design.
Two threads on the same core are tightly coupled together.
I don't want to talk about SMT in general, but SMT in secarios where
a lot of load-stalls occur in each thread.
You would need to try to ensure that reader threads that are bound
to a single SMT core access similar data, and preferably execute
down similar code-paths. Cache locality is very important, false-sharing,
cache-blocking, ect...
Usually, lock-free designs that avoid atomic operations and StoreLoad
style memory barriers are able to make "some" performance gains on an
SMT system...
Aaaaah, it's unbelievable - you are such an IDIOT!
You're again mis-using a thread with very common blah-blah to demonstrate
your lock-free and thrading competencies and not to really contribute to
the discussion. Please shut up until you understood the constraints of
database-engines.
http://groups.google.com/group/comp.arch/msg/2f40a2aaeb8cacb6?hl=en
[/quote]
The first entry in my new kill file!
--
David Gay
dgay@acm.org |
|
| Back to top |
|
 |
Bill Todd
Guest
|
Posted:
Thu Oct 13, 2005 12:15 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
Oliver S. wrote:
[quote]Has anyone found information on how much hyperthreading is able to
improve the
performance of database-workloads (OTP as well as DWH)?
[/quote]
My recollection is that POWER5's SMT is said to give it something like a
35% boost in TPC-C, and Montecito's coarser-grained 'hyperthreading' is
said to provide less (more like 25%). Those of course are both
dual-thread SMT implementations without any more execution units than
their non-SMT predecessors: EV8's quad-thread implementation did (IIRC)
contain more execution units, was fine-grained, and was said to provide
over 2x (possibly as much as 3x - it's been a long time since I visited
the material) the TPC-C throughput that a non-SMT version would have
managed.
I don't recall where any of those numbers came from, though - so if you
want sources, start Googling.
- bill |
|
| Back to top |
|
 |
David Kanter
Guest
|
Posted:
Thu Oct 13, 2005 8:15 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
Bill Todd wrote:
[quote]Oliver S. wrote:
Has anyone found information on how much hyperthreading is able to
improve the
performance of database-workloads (OTP as well as DWH)?
My recollection is that POWER5's SMT is said to give it something like a
35% boost in TPC-C, and Montecito's coarser-grained 'hyperthreading' is
said to provide less (more like 25%). Those of course are both
dual-thread SMT implementations without any more execution units than
their non-SMT predecessors: EV8's quad-thread implementation did (IIRC)
contain more execution units, was fine-grained, and was said to provide
over 2x (possibly as much as 3x - it's been a long time since I visited
the material) the TPC-C throughput that a non-SMT version would have
managed.
[/quote]
The EV8 had quite a few more functional units than anything shipping
today, see Paul DeMone's article:
http://www.realworldtech.com/page.cfm?ArticleID=RWT021802145442&p=2
8 ALUs
4 FPUs
2 LD units
2 ST units
It was estimated by Joel Emer at about a 225-230% boost (hard to tell
with the graph and scale):
www.cs.washington.edu/research/smt/papers/compaqMF.ppt
This persuades me that a chip designed for SMT from the ground up can
get quite a bit better than just 40%. The real question is whether you
are better off with CMP than a wide SMT...hard to say and we'll
probably never know.
David |
|
| Back to top |
|
 |
Bill Todd
Guest
|
Posted:
Thu Oct 13, 2005 8:15 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
David Kanter wrote:
[quote]Bill Todd wrote:
Oliver S. wrote:
Has anyone found information on how much hyperthreading is able to
improve the
performance of database-workloads (OTP as well as DWH)?
My recollection is that POWER5's SMT is said to give it something like a
35% boost in TPC-C, and Montecito's coarser-grained 'hyperthreading' is
said to provide less (more like 25%). Those of course are both
dual-thread SMT implementations without any more execution units than
their non-SMT predecessors: EV8's quad-thread implementation did (IIRC)
contain more execution units, was fine-grained, and was said to provide
over 2x (possibly as much as 3x - it's been a long time since I visited
the material) the TPC-C throughput that a non-SMT version would have
managed.
[/quote]
....
[quote]It was estimated by Joel Emer at about a 225-230% boost
[/quote]
As a mathematician, you really ought to be more careful with your
terminology (and this isn't the first time I've noticed that, which is
why I'm commenting upon it): the 'boost' you're describing is 125% - 130%.
(hard to tell
[quote]with the graph and scale):
www.cs.washington.edu/research/smt/papers/compaqMF.ppt
This persuades me that a chip designed for SMT from the ground up can
get quite a bit better than just 40%.
[/quote]
Well, even with the added execution units when running only two threads
the EV8 managed less than 70% in the 'TP' workload described (and did
even worse in some of the other workloads when limited to two threads):
the ability to support four concurrent threads (and to keep them
reasonably well-supplied with resources) was its most significant advantage.
The real question is whether you
[quote]are better off with CMP than a wide SMT...hard to say
[/quote]
Not really. Even as cores continue to diminish in size, *some* level of
SMT will remain desirable insofar as it allows one to put to good use
more execution units whether to enhance the performance of a single
thread or to enhance the performance of multiple concurrent threads
within the single core (i.e., it provides a core which can handle a
wider range of workloads more closely to optimally, rather than a static
arrangement either starved for execution units when servicing a number
of demanding threads lower than the number of cores or leaving execution
units idle even when a number of far-less-demanding threads covers all
the cores).
So the *real* question is whether that's *enough* of an improvement to
justify the added design effort (and relatively small additional
physical overheads - at least as evidenced by current examples) involved
- and if the answer is 'yes', then just what level of multi-threading
within the multiple separate cores on a chip is ideal across the normal
distribution of real-world workloads (one can't just suggest that SMT
could eliminate *all* need for CMP since wire and synchronization delays
within a single core are non-negligible factors which bound total core
size even if the complexity of, say, supporting many dozens of
concurrent threads did not).
EV8 may have occupied a unique moment in time when placing multiple
relatively high-performance cores on a single chip was not yet quite
feasible but when the level of performance desired from a single thread
was not yet so limited by the 'memory wall' that single-thread
performance had ceased to be so desirable - at least if you could find
good uses for the many execution units at other times as well to make
the chip more generally useful. Even so, it should be some time yet
before such considerations fade away completely (and if ever something
pushes back that memory wall sufficiently they'll resurface).
- bill |
|
| Back to top |
|
 |
David Kanter
Guest
|
Posted:
Fri Oct 14, 2005 12:12 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
Bill Todd wrote:
| Quote: | David Kanter wrote:
Bill Todd wrote:
Oliver S. wrote:
Has anyone found information on how much hyperthreading is able to
improve the
performance of database-workloads (OTP as well as DWH)?
My recollection is that POWER5's SMT is said to give it something like a
35% boost in TPC-C, and Montecito's coarser-grained 'hyperthreading' is
said to provide less (more like 25%). Those of course are both
dual-thread SMT implementations without any more execution units than
their non-SMT predecessors: EV8's quad-thread implementation did (IIRC)
contain more execution units, was fine-grained, and was said to provide
over 2x (possibly as much as 3x - it's been a long time since I visited
the material) the TPC-C throughput that a non-SMT version would have
managed.
...
It was estimated by Joel Emer at about a 225-230% boost
As a mathematician, you really ought to be more careful with your
terminology (and this isn't the first time I've noticed that, which is
why I'm commenting upon it): the 'boost' you're describing is 125% - 130%.
|
You're right, sorry about that, it was rather late. Yes, it is a
125-130% boost over the non-SMT case.
| Quote: | (hard to tell
with the graph and scale):
www.cs.washington.edu/research/smt/papers/compaqMF.ppt
This persuades me that a chip designed for SMT from the ground up can
get quite a bit better than just 40%.
Well, even with the added execution units when running only two threads
the EV8 managed less than 70% in the 'TP' workload described (and did
even worse in some of the other workloads when limited to two threads):
the ability to support four concurrent threads (and to keep them
reasonably well-supplied with resources) was its most significant advantage.
The real question is whether you
are better off with CMP than a wide SMT...hard to say
Not really. Even as cores continue to diminish in size, *some* level of
SMT will remain desirable insofar as it allows one to put to good use
more execution units whether to enhance the performance of a single
thread or to enhance the performance of multiple concurrent threads
within the single core (i.e., it provides a core which can handle a
wider range of workloads more closely to optimally, rather than a static
arrangement either starved for execution units when servicing a number
of demanding threads lower than the number of cores or leaving execution
units idle even when a number of far-less-demanding threads covers all
the cores).
So the *real* question is whether that's *enough* of an improvement to
justify the added design effort
|
That was part of my question/point. What I really want to know is
this:
'Doubling' the performance using CMP usually doubles the die size (and
a little more). However, it keeps the core at the same size.
What does it take to double the performance using SMT? What does it do
to overall die size and core size?
Core size is important because a big core --> longer pipelines to drive
data across the chip.
We know adding SMT is very small and affordable, even for a 4T design.
However, how much would it cost in terms of additional function units,
branch prediction mechanisms, etc. etc. to take todays existing 2
threaded designs and provide a 70% boost?
I guess what I am trying to get it is "What are the costs to double
performance using SMT, compared to CMP"?
The biggest cost for SMT is probably validation.
| Quote: | (and relatively small additional
physical overheads - at least as evidenced by current examples) involved
- and if the answer is 'yes', then just what level of multi-threading
within the multiple separate cores on a chip is ideal across the normal
distribution of real-world workloads (one can't just suggest that SMT
could eliminate *all* need for CMP since wire and synchronization delays
within a single core are non-negligible factors which bound total core
size even if the complexity of, say, supporting many dozens of
concurrent threads did not).
|
Precisely. We're on the same page.
| Quote: | EV8 may have occupied a unique moment in time when placing multiple
relatively high-performance cores on a single chip was not yet quite
feasible but when the level of performance desired from a single thread
was not yet so limited by the 'memory wall' that single-thread
performance had ceased to be so desirable - at least if you could find
good uses for the many execution units at other times as well to make
the chip more generally useful. Even so, it should be some time yet
before such considerations fade away completely (and if ever something
pushes back that memory wall sufficiently they'll resurface).
|
Fundamentally, the issue stems from the memory wall, but immediately
the issue was heat and power.
David |
|
| Back to top |
|
 |
JJ
Guest
|
Posted:
Fri Oct 14, 2005 12:15 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
David Kanter wrote:
snipping
| Quote: | 'Doubling' the performance using CMP usually doubles the die size (and
a little more). However, it keeps the core at the same size.
What does it take to double the performance using SMT? What does it do
to overall die size and core size?
Core size is important because a big core --> longer pipelines to drive
data across the chip.
We know adding SMT is very small and affordable, even for a 4T design.
However, how much would it cost in terms of additional function units,
branch prediction mechanisms, etc. etc. to take todays existing 2
threaded designs and provide a 70% boost?
I guess what I am trying to get it is "What are the costs to double
performance using SMT, compared to CMP"?
|
One way of looking at the problem is to see that any cache even L1 is
itself a memory wall, even if it's only 1-few cycles because only 1
thread is serialized through 1 SRAM.
If one projects a relatively simple 4way MTA that runs much faster than
normal complex designs but pushes the L1 down to be slower and much
bigger or just use the L2 directly but massively interleave it and use
all the banks concurrently for all the threads that will be in flight.
Getting all the banks to work concurrently is what will make the wall
fall down and marrying to multiple PEs designed to exploit that huge
banking enabled issue rates.
| Quote: | The biggest cost for SMT is probably validation.
(and relatively small additional
physical overheads - at least as evidenced by current examples) involved
- and if the answer is 'yes', then just what level of multi-threading
within the multiple separate cores on a chip is ideal across the normal
distribution of real-world workloads (one can't just suggest that SMT
could eliminate *all* need for CMP since wire and synchronization delays
within a single core are non-negligible factors which bound total core
size even if the complexity of, say, supporting many dozens of
concurrent threads did not).
Precisely. We're on the same page.
EV8 may have occupied a unique moment in time when placing multiple
relatively high-performance cores on a single chip was not yet quite
feasible but when the level of performance desired from a single thread
was not yet so limited by the 'memory wall' that single-thread
performance had ceased to be so desirable - at least if you could find
good uses for the many execution units at other times as well to make
the chip more generally useful. Even so, it should be some time yet
before such considerations fade away completely (and if ever something
pushes back that memory wall sufficiently they'll resurface).
|
To be first always seems to hurt badly:)
| Quote: | Fundamentally, the issue stems from the memory wall, but immediately
the issue was heat and power.
|
The memory wall fades away if the big fat SRAM is forced to deliver
alot more concurrency, which can be exploited by many MTA PEs, and it
does require to do a few things that might seem unpalatable
(randomizing etc). And no it isn't black magic or 666 or any such
voodoo nonsense. Just requires some out of the box thinking that is
more familiar in DSP terms. More later.
John |
|
| Back to top |
|
 |
Jens Meyer
Guest
|
Posted:
Mon Oct 17, 2005 8:15 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
I hope this extreme braniac architectures will die!
I'd rather like to see cores like that of Sun's Niagara.
My dream-CPU would look like the following:
- four or even six in-order pipes per core
- no speculative execution
- all pipes competing for four execution-units;
one ALU, one load/store-unit, one fp-adder and one fp-multiplier
- of course: fully pipelined execution-units, i.e. every execution unit
can take one request per clock-cycle from any pipe, but a pipe always
stalls until its request has finished
- i- & d-caches in with sizes in the magnitude of Niagara's l1-caches
- large shared l2-cache (f.e. 2MB for a desktop-CPU and eight MB for a
server-CPU)
- support for execute-ahead (aka scouting) on unused threads on a core
I think that such a simple archicteture would make it much easier to
get high clock-rates as the pipes are two magnitudes simpler than those
of full-blown brainiac cores. |
|
| Back to top |
|
 |
Jens Meyer
Guest
|
Posted:
Mon Oct 17, 2005 8:15 am Post subject:
Re: hyperthreading in database-benchmarks |
|
|
| Quote: | - four or even six in-order pipes per core
- no speculative execution
- all pipes competing for four execution-units;
one ALU, one load/store-unit, one fp-adder and one fp-multiplier
- of course: fully pipelined execution-units, i.e. every execution unit
can take one request per clock-cycle from any pipe, but a pipe always
stalls until its request has finished
- i- & d-caches in with sizes in the magnitude of Niagara's l1-caches
- large shared l2-cache (f.e. 2MB for a desktop-CPU and eight MB for a
server-CPU)
- support for execute-ahead (aka scouting) on unused threads on a core
|
I forgot the following:
- a cache-bus from the L2 to the L1-cache as wide as possible *g* |
|
| Back to top |
|
 |
|
|
|
|