AMD to leave x86 behind?
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
AMD to leave x86 behind?
Goto page Previous  1, 2, 3, 4, 5, 6 ... 20, 21, 22  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Robert Redelmeier
Guest





Posted: Wed Nov 02, 2005 9:15 am    Post subject: Re: AMD to leave x86 behind? Reply with quote

In comp.sys.ibm.pc.hardware.chips YKhan <yjkhan@gmail.com> wrote:
Quote:
Bill Davidsen wrote:
Do you believe that AMD uses every execution unit on every cycle?
Hopefully not, which raises the possibility of SMT, at least in theory.

No, but I don't believe AMD has nearly enough execution units to
devote to a second thread at acceptable speeds. I'd say keeping
the number of pipeline stages the same and the same number of
execution units in both a P4 and an Athlon, the P4 might be
able to run both threads at 70% of single-thread efficiency,
whereas the Athlon might only be able to execute 40% efficiency.

So the Athlon gets bogged to 80% by SMT while the P4 turkey
flies at 140% workload? No.

The P4 is a horrible throwback. It has an impressive array
of Exec Units, but only TWO issue ports like the original
1993 Pentium. The Athlons (and PPro-PentM) have three issue
ports and much better IPC.

Whether SMT helps depends on the workload, and particularly
how much time the CPU spends stalled on uncached memory reads
and pipeline flushes. The latest AMDs have lower latency and
shorter pipelines so will benefit less, 'cuz they're already
faster clock-for-clock.

-- Robert
Back to top
Yousuf Khan
Guest





Posted: Wed Nov 02, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

Patrick Schaaf wrote:
Quote:
Yousuf Khan <bbbl67@ezrs.com> writes:


Okay, I stand corrected. Now one question arises from this. How do the
Opterons themselves know how to handle the message passing between
themselves? Do they just broadcast out in every direction and hope it
gets there with the smallest route, and ignore any duplicates coming in
later from non-optimal routes? Or is there a lookup table that gets
programmed into each CPU telling it which direction to send each message?


There is an elaborate set of special purpose registers, which is used
at system startup to configure routing tables in each processor,
for memory address ranges and I/O stuff, IIRC. This is presumably
done by the BIOS.

There's a "BIOS and kernel programmer's guide" somewhere on the AMD
site, which describes this in detail. Pub.Nr #26094 from the copy
I find lying around here.

Actually, even before the BIOS starts up, there has to be some kind of
default setup for the Opterons, isn't there? How does the BIOS
individually program each Opteron? It has to start on only one of them
first, so one of them has to be the primary Opteron. How does this get
decided?

Yousuf Khan
Back to top
Yousuf Khan
Guest





Posted: Wed Nov 02, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

Robert Redelmeier wrote:
Quote:
No, but I don't believe AMD has nearly enough execution units to
devote to a second thread at acceptable speeds. I'd say keeping
the number of pipeline stages the same and the same number of
execution units in both a P4 and an Athlon, the P4 might be
able to run both threads at 70% of single-thread efficiency,
whereas the Athlon might only be able to execute 40% efficiency.


So the Athlon gets bogged to 80% by SMT while the P4 turkey
flies at 140% workload? No.

No, what I'm saying is that with typical programs, which by themselves
are hardly ever stressing the CPU 100% will each run at the some average
percentage of their typical single-threaded speed. If both threads are
stressing the CPU 100%, then everybody slows down. However, if one is
stressing the processor while the other is not, then both can be running
at their typical speed.

Yousuf Khan
Back to top
Yousuf Khan
Guest





Posted: Wed Nov 02, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

Del Cecchi wrote:
Quote:
Okay, I stand corrected. Now one question arises from this. How do the
Opterons themselves know how to handle the message passing between
themselves? Do they just broadcast out in every direction and hope it
gets there with the smallest route, and ignore any duplicates coming in
later from non-optimal routes? Or is there a lookup table that gets
programmed into each CPU telling it which direction to send each
message?

Yousuf Khan


I have never heard that the multiple HT ports include a switch or a
routing table. My guess is that a broadcast protocol is used.

I guess the next question would be how do they identify duplicates?

Yousuf Khan
Back to top
Stephen Fuld
Guest





Posted: Wed Nov 02, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

"David Schwartz" <davids@webmaster.com> wrote in message
news:dk9kg0$m86$1@nntp.webmaster.com...
Quote:

"Stephen Fuld" <s.fuld@PleaseRemove.att.net> wrote in message
news:5NQ9f.10906$qk4.277@bgtnsc05-news.ops.worldnet.att.net...

That sounds reasonable but it I believe it assumes eight CPUs. If they
wanted to address the market for larger systems, what are the numbers for
16 or 32 CPUs? They could certainly charge a premium for such systems
and you might be able to have the number of links be a bond out option,
thus keeping their costs low and giving an extra market segment to
compete in.

Once you're in the16 or 32 CPU market, you can afford a central
non-blocking switch. So three ports is plenty.

That makes sense. Thanks.

--
- Stephen Fuld
e-mail address disguised to prevent spam
Back to top
Oliver S.
Guest





Posted: Thu Nov 03, 2005 1:15 am    Post subject: Re: AMD to leave x86 behind? Reply with quote

Quote:
These instructions wouldn't work better than the prefetching-instructions
currently implemented. I think it would be cleverer to copy hw-scouting
from Sun's upcoming CPUs. HW-scouting is simple to implement if you're
going to have a SMT-core anyway.

So what's HW-scouting?

Hardware-scouting executes ahead the instructions in the instruction-stream
as far as possible and up to a maximum depth (to prevent trashing the cache)
to do some more intelligent kind of prefetching. It's said to give an average
performance-boost of 10%.
Back to top
Guest






Posted: Thu Nov 03, 2005 1:15 am    Post subject: Re: AMD to leave x86 behind? Reply with quote

Yousuf Khan wrote:
Quote:
Now one question arises from this. How do the
Opterons themselves know how to handle the message passing between
themselves? Do they just broadcast out in every direction and hope it
gets there with the smallest route, and ignore any duplicates coming in
later from non-optimal routes? Or is there a lookup table that gets
programmed into each CPU telling it which direction to send each message?

Yousuf Khan

As memory requests enter the NorthBridge, the physical address is
converted into a 6-bit NodeID throuth a Boot programmed table. Target
NodeID is used to route to the service unit and Requestor NodeID is
used to route back to the requstor.

When a memory request arrives at a memory controller, it is inserted
into a queue. When there is no other earlier requests to a given cache
line in the queue, the request will be forwarded into the DRAM
controller for processing and simultaneously, the PROBE will be
inserted into the hypertransport fabric for routing to the coherent
caches/CPUs. Therefore, the memory controler determine memory order.
The PROBE responses and the DRAM data meet back at the originating Node
and then a target done is sent to the memory controler to enable the
next access to that line. Therefore the requestor determines when a
transaction completes.
Back to top
Bengt Larsson
Guest





Posted: Thu Nov 03, 2005 1:15 am    Post subject: Re: AMD to leave x86 behind? Reply with quote

In comp.arch, "YKhan" <yjkhan@gmail.com> wrote:

Quote:
Bill Davidsen wrote:
Do you believe that AMD uses every execution unit on every cycle?
Hopefully not, which raises the possibility of SMT, at least in theory.

No, but I don't believe AMD has nearly enough execution units to devote
to a second thread at acceptable speeds. I'd say keeping the number of
pipeline stages the same and the same number of execution units in both
a P4 and an Athlon, the P4 might be able to run both threads at 70% of
single-thread efficiency, whereas the Athlon might only be able to
execute 40% efficiency.

For what it's worth, I'm running a 3.0 GHz Prescott here and
throughput on SETI@home is about 8000 seconds for one workunit with HT
off, and 11000 seconds for two workunits with HT on.
Back to top
Yousuf Khan
Guest





Posted: Thu Nov 03, 2005 1:15 am    Post subject: Re: AMD to leave x86 behind? Reply with quote

MitchAlsup@aol.com wrote:
Quote:
As memory requests enter the NorthBridge, the physical address is
converted into a 6-bit NodeID throuth a Boot programmed table. Target
NodeID is used to route to the service unit and Requestor NodeID is
used to route back to the requstor.

When a memory request arrives at a memory controller, it is inserted
into a queue. When there is no other earlier requests to a given cache
line in the queue, the request will be forwarded into the DRAM
controller for processing and simultaneously, the PROBE will be
inserted into the hypertransport fabric for routing to the coherent
caches/CPUs. Therefore, the memory controler determine memory order.
The PROBE responses and the DRAM data meet back at the originating Node
and then a target done is sent to the memory controler to enable the
next access to that line. Therefore the requestor determines when a
transaction completes.


Great, thanks.

Yousuf Khan
Back to top
Bill Davidsen
Guest





Posted: Thu Nov 03, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

Bill Todd wrote:
Quote:
Bill Davidsen wrote:

Yousuf Khan wrote:

David Kanter wrote:

If you think that any modern MPU is efficient, you are smoking crack.
They all have plenty of unused cycles left on the table (except when
running linpack).




But the secret is to have enough idle cycles to run both threads at
close to full speed each. I'd say anything that had enough to run
both threads at 80% full speed, was a reasonably successful SMT.



I suspect you still don't understand SMT...


A failing you may share yourself.

if you had enough idle

cycles to run 2x80% you should either get rid of all the silicon
wasted in idle capacity or add a little more and go full dual core.


In which case your single-thread performance would suck on applications
that stressed the core to unusual levels (lots of potential ILP, few
memory stalls).

Only if you assume that such cases are common and that seems at odds
with being able to run two threads at 80%. I said in my original post
that if you have enough execution units to handle the worst case for a
single thread, the most that could ever be done at once, then SMT was a
cheap way to pick up some use of the unused units at times when they
would be idle. So I did note that enough units were needed to satisfy
maximum single thread requirements. That was in the first paragraph
after I said what I posted was simplified.
Quote:

IIRC EV6/7 Alphas still spend on average something like 50% of their
time waiting for memory, so a fine-grained SMT implementation could
potentially run two threads at close to 100% of the single-thread speed.

That indicates some real problems with the memory interface, which I
don't believe are present in x86, particularly the AMD chips. The reason
I reach that conclusion is purely experimental, on a CPU bound task, ie.
huge compile, a 10% increase in clock speed (overclock) results in
almost exactly a 10% drop in real time, as does 15%. I conclude from
that data that the disk and memory paths are not the limiting factor at
those speed, and the ability of the CPU to process instructions is.

If the CPU were maxing out the memory bandwidth for any significant time
the performance gain should be non-linear.

Quote:
EV8 simulations indicated that running 4 threads could (depending upon
the characteristics of the workload) increase throughput 2x - 3x over a
non-SMT implementation. And EV8 wasn't even primarily *aimed* at SMT:
it was aimed at industry-leading single-thread performance in
highly-parallelizable code, had enough silicon devoted to the job to
ensure that, and used SMT as a way to leverage that silicon when it
wasn't being used for its primary purpose.

I can't argue with that, and it fits the model of having all the
capability needed by a single thread and using it when idle. That's a
huge increase, from your earlier comment I suspect some serious memory
wait issues.
Quote:

Core silicon is cheap, and getting cheaper still. Core power less so,
but with the developing ability to shut down unused elements at very
fine time granularity that problem is being addressed. What SMT gives a
core is the flexibility to be good at *both* single-thread and
multi-thread performance in ways that single-threaded cores cannot be,
at a relatively small cost in additional chip area (5% - 10%, depending
upon the implementation granularity - IIRC those are about the numbers
for Xeon vs. POWER5 and EV8; Montecito's
'switch-on-event-multi-threading' may be even less).

I don't really think we're disagreeing here, although you might have
missed my initial comment on having enough capacity for a single best
case thread, based on your comment on single thread performance.

--
bill davidsen
SBC/Prodigy Yorktown Heights NY data center
http://newsgroups.news.prodigy.com
Back to top
Bill Davidsen
Guest





Posted: Thu Nov 03, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

Oliver S. wrote:
Quote:
These instructions wouldn't work better than the
prefetching-instructions
currently implemented. I think it would be cleverer to copy hw-scouting
from Sun's upcoming CPUs. HW-scouting is simple to implement if you're
going to have a SMT-core anyway.


So what's HW-scouting?


Hardware-scouting executes ahead the instructions in the instruction-stream
as far as possible and up to a maximum depth (to prevent trashing the
cache)
to do some more intelligent kind of prefetching. It's said to give an
average
performance-boost of 10%.

Interesting concept, do they fetch into L1 or L2?

Thinking about speculative fetch vs. speculative execution, how does
speculative execution play with SMT? It would seem that they compete for
the same resources, although clearly more could be provided.

--
bill davidsen
SBC/Prodigy Yorktown Heights NY data center
http://newsgroups.news.prodigy.com
Back to top
Bill Davidsen
Guest





Posted: Thu Nov 03, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

keith wrote:
Quote:
On Tue, 01 Nov 2005 17:52:40 +0000, Bill Davidsen wrote:


YKhan wrote:


how about some form of SMT for AMD?


I don't know that might come too, but it can't be done as easily as
Hyperthreading. Hyperthreading relied on the Pentium 4's inherent
inefficiency to run a lot of threads simultaneously.


Actually not. It takes advantage of the fact that the CPU has a number
of instruction units, adders, FPU, etc, which one thread MAY use at once
in some cases, but which are not ALWAYS all in use.


Does Intel's HT issue from different threads on the same cycle? I thought
not, which makes the number of execution units moot. Certainly there are
units idle in any given cycle, that's where dynamic power management comes
into play.

I hope someone with that information contributes a factual answer, I've
seen the number two in this group and other places, but I've not read it
in anything official.
Quote:


By adding another instruction decoder one or more additional threads can
run using the resources which are otherwise not needed at some clock
tick.


Not easily. It's pretty tough to be able to fetch and decode from
different threads simultaneously. Another decoder doesn't really helpmuch.


Do you believe that AMD uses every execution unit on every cycle?


I certainly don't, but I also don't believe Intel uses every execution
unit every cycle. I could be wrong, bu AIUI they cannot dispatch
instructions from two threads in the same cycle.


Hopefully not, which raises the possibility of SMT, at least in theory.


Ok, but surely you know that theory <> reality, except in theory.

In practice I see 30% gain on the best case things and 3-5% on the
worst. I have seen reports of worse performance with HT on, but not
under Linux, which has different CPU affinity code than Windows. I very
carefully said "different" rather than "better," before anyone loses
bladder control.
Quote:


The question is one of benefit, typically on a task such as Linux kernel
build, or a large application build, SMT reduces the clock time by
20-30%.


Maybe, but that has nothing to do with "idle" EUs.

Would you like "idle resources" better?
Quote:


No matter how you spin it, it the task runs in less time I call
the CPU faster.


If? Ok, but it's not for the reasons you presume, not is it every set of
tasks that will benefit. If your set of tasks is HT friendly, go fer it!
TO presume others are similarly enhanced, is simply nuts.

Absolutely. And there was a recent post the seti@home ran 8000 sec/unit
uni and 11000 sec/2unit HT. I haven't seen that, but clearly there is a
huge variation in benefit, and I hope nothing I posted implied otherwise.
Quote:


No, it doesn't help a single threaded task, other than moving the O/S
and display usage into the "other" thread.


Neither does SMP, but my position is that any application that will
benefit from SMT will benefit more from SMP. SMP is here.

SMP has always been here, I have a dual P5-mmx board "still going"
because it does the job. I had dual PPro and Xeon boards, and the one
thing they had in common was they co$t a lot. The motherboards were
larger (server targeted), had more sockets, different chipsets, more
traces, and were priced as if the buyer were build a big server and
would cringe at another $100 or so. Until the advent of dual-core chips
they were only cost effective in limited applications, particularly
since Windows didn't use them, NT was more expense, SCO wanted big
per-CPU bucks, and Linux didn't support them in any useful way until the
mid-90's.

HT was the cheap way to get the largest benefit of SMP, reduced context
switches and much of the O/S code being done while the user code was
running. And clearly some IPC is far more effective when threads share
cache and don't have to go to main memory, do bus snooping, etc, etc.

Frankly if/when I build a dual-core Intel system, I'll probably spend
the money to get the chip with the big cache and HT enabled. The price
difference is minimal, and I would love to run performance tests against
two chip HT Xeons I use.

--
bill davidsen
SBC/Prodigy Yorktown Heights NY data center
http://newsgroups.news.prodigy.com
Back to top
Jeremy Linton
Guest





Posted: Thu Nov 03, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

Eric P. wrote:
Quote:
The current x86 appears to require:
- 5 writes to save old SS, ESP, CS, EIP, EFLAGS
- read interrupt gate from IDT to get new EIP & DPL
- 3 reads for kernel CS, SS and ESP from task TSS for that DPL
- read SS & SS descriptors from GDT/LDT (assuming it cached the
GDT&LDT bases when the TSS was set, otherwise read them too)
- Interrupt routine must immediately save old and load segment register
appropriate for kernel mode (DS, ED, FS, GS) and therefor their
appropriate descriptors.
- So I'm guessing at least 9 writes, 9 reads, and a dozen instructions
before saving any general registers.
- All of the above have a plethora of segment, validity and access
checks (the pseudo code for IRET is quite lengthy).

Now assuming one could toss the entire current design, an interrupt
should take 3 writes (old EFLAGS, EIP, ESP) to push onto the kernel
stack, and 1 read (the new EIP). The kernel stack pointer and eflags
can be stored in protected registers and don't memory access to load.
Segment registers are only used by 32 bit legacy mode apps and
so need not be touched by interrupt handlers, only by task switcher.
First the sequential reads/writes are basically free after the first

one, given proper alignment, since the cache/TLB is warmed, and the
memory subsystem is then in burst mode, so its not that bad.
Secondly the x86-64 arch has segment registers, but they are basically
ignored (AMD arch reference pg 82 "In 64-bit mode, segmentation is
disabled.") with limitations for cs,fs,gs , since they have fixed
base/length, so I assume that all the updating or reading the segment
registers on interrupt is not an issue.
The "1 read (the new EIP)" must be assuming that the modified IDT is
stored somewhere in a fixed real location. Something i'm not sure would
be a good idea, more like an ugly hack. Which brings you back to reading
the appropriate virtual base address+interrupt number offset, and the
ugliness that entails.
Lastly, the CPL vs DPL distinction is massively useful, parts of which
remain in the x86-64 stuff, and shave a lot of cycles off real code
responsible for modifying user address ranges from the kernel. Doing
away with that mechanism for an "interrupt mode" adds yet another
asymmetrical mode, again probably not a good idea.

This all brings me back to my knowledge of a major PPC OS's interrupt
hander. The PPC basically takes the interrupt by flipping to supervisor
mode, jumping to the physical interrupt vector and stuffing a few
important registers in SPRs. Its pretty fast on the hardware side, but
then the OS turns around and spends a few thousand cycles setting up the
appropriate environment to get back into a usable "C" model and jumping
to the correct interrupt vector before reenabling interrupts. The end
result was painful interrupt latencies which didn't compare favorably to
major x86 os's.


Quote:

With a redesigned MMU then even much of this could be avoided
(bottom up translate, lockable tlb entries, large variable sized
pages mapping whole kernel).
BTW: Current kernels (both NT and Linux) use 4M pages for the base

kernel information, and the 4M and 4K TLB entries are separate in
hardware, (on the couple x86's i'm sure about) thereby getting around
the need for "lockable entries", since everything else in the system is
in the 4k TLB's.


Quote:
It sounds like you are referring to ARMs bank switched register set.
Those are fine if you don't have many interrupt priority levels.
But for a general purpose cpu and OS I can see 8 or more levels:

- Bus error
- Clock
- Power fail
- Inter Processor Interrupt
- External device High
- External device Medium
- External device Low
- Scheduler

which would require at too many register sets. At that point saving
the prior state on a kernel stack seems the best approach.
BTW: It is not unusual to have a number of physical states chained to a

single interrupt hander on any arch. The ARM has 6 different register
banking modes for interrupts, exceptions and system calls. This isn't as
bad as it sounds because 4 of them only bank 3 registers (13,14,SPSR)
and FIQ bank switches (8-14 and SPSR). The first 4 banked modes aren't
that different from any other RISC CPU. Only FIQ is weird.

Normal interrupt mode is then responsible for determining what caused
the interrupt and making the appropriate jump. The ability to prioritize
interrupts is a function of the interrupt controller or the interrupt
hander. The bank switching is really only useful for FIQ which is sort
of reserved for extremely low latency applications. I've never needed
it, instead using the FIQ pin for alternate IO or whatever else was
mux'ed onto the external FIQ pin.

Anyway, the banking can be quite annoying sometimes, moving the "saved"
registers to SPR's would have been nicer, since reading the saved
registers wouldn't require mode switches. For example when I'm taking a
timer interrupt to deschedule a task and schedule another task, in
order to save the old tasks state, I have this ugly piece of code which
saves R0-R12 while only using R13,R14 then switches to system mode,
saves the original R13,R14 and CSPR for the current task, then switches
to supervisor mode, does some cleanup to enable interrupts, and runs
some complicated priority based scheduling algorithm in C++ and then
when that finishes, flips back to system mode loads the new register
set, flips back to interrupt mode, and reverses the register saves to
restore the machine state for the new task, then finally completes the
original timer interrupt, to flip back to user mode. The whole things
seems about 10x as complicated as it needs to be, just to do a simple
context switch.
Back to top
Eric P.
Guest





Posted: Thu Nov 03, 2005 5:15 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

MitchAlsup@aol.com wrote:
Quote:

Yousuf Khan wrote:
Now one question arises from this. How do the
Opterons themselves know how to handle the message passing between
themselves? Do they just broadcast out in every direction and hope it
gets there with the smallest route, and ignore any duplicates coming in
later from non-optimal routes? Or is there a lookup table that gets
programmed into each CPU telling it which direction to send each message?

Yousuf Khan

As memory requests enter the NorthBridge, the physical address is
converted into a 6-bit NodeID throuth a Boot programmed table. Target
NodeID is used to route to the service unit and Requestor NodeID is
used to route back to the requstor.

When a memory request arrives at a memory controller, it is inserted
into a queue. When there is no other earlier requests to a given cache
line in the queue, the request will be forwarded into the DRAM
controller for processing and simultaneously, the PROBE will be
inserted into the hypertransport fabric for routing to the coherent
caches/CPUs. Therefore, the memory controler determine memory order.
The PROBE responses and the DRAM data meet back at the originating Node
and then a target done is sent to the memory controler to enable the
next access to that line. Therefore the requestor determines when a
transaction completes.

Presumably these Probes are checking the peer node caches,
and if it hits the probe response returns the data.
Does the memory controller also monitor the Probe replies
and abort the DRAM access cycle if hit?
Wouldn't this cause a lot of unnecessary DRAM page closes and
opens and thereby possibly even slow down subsequent accesses?

BTW Where is this documented? I have been through all the HT docs
I could find on ht.org and amd.com, and none describes the
coherency protocol.

Thanks
Eric
Back to top
Anton Ertl
Guest





Posted: Thu Nov 03, 2005 10:50 pm    Post subject: Re: AMD to leave x86 behind? Reply with quote

Bill Davidsen <davidsen@deathstar.prodigy.com> writes:
Quote:
keith wrote:
Does Intel's HT issue from different threads on the same cycle? I thought
not, which makes the number of execution units moot. Certainly there are
units idle in any given cycle, that's where dynamic power management comes
into play.

I hope someone with that information contributes a factual answer,

The Williamette/Northwood trace cache dispatches 6 microinstructions
every two cycles. AFAIK with HT it dispatches alternatingly from one
thread and from the other. Anyway, once the microinstructions are in
the out-of-order execution core, they are issued to the execution
units possibly out-of-order, and of course microinstructions from the
same thread can be issued and executed in the same cycle. That's the
point of SMT.

Quote:
In practice I see 30% gain on the best case things and 3-5% on the
worst. I have seen reports of worse performance with HT on, but not
under Linux, which has different CPU affinity code than Windows.

CPU affinity should not play a role with HT on a single core.

If you lose with SMT, I expect that comes from thrashing the L1
caches. The Williamatte/Northwood L1 caches are not very big.

Moreover, they have other funny things going on, like replays (AFAIK
to recover from misscheduling), and I don't know how that interacts
with SMT.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6 ... 20, 21, 22  Next
Page 5 of 22

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB