| Author |
Message |
Jeremy Linton
Guest
|
Posted:
Tue Nov 01, 2005 12:04 am Post subject:
Re: AMD to leave x86 behind? |
|
|
Tim McCaffrey wrote:
| Quote: | Instructions that load/store/copy memory that control how much
cache pollution is done and/or communicate to the bridges and
I/O devices how much data is being loaded/stored could improve
efficiency on the I/O (PCI) and memory busses.
Ahhh.. Maybe you should look at the MTRR's (memory Type range |
registers) PAT (Page Attribute Table) and the prefetch and Non-temporal
instructions already provided. I've seen near therotical throughput
numbers both on the memory subsystem and on PCI busses given properly
tuned code. Its not that you can't control such things with the x86 its
just that I haven't seen a compiler generate optimal code.
AMD had a very nice document they wrote a few years ago about how to
get max throughput with memory copy operations, where they compared
diffrent methods and instructions for doing the memory copy. If I
remember correctly in the end they got nearly theoritical bandwidth
numbers by doing a simple loop to preread (with actual register load
instead of prefretch) cache block size reads followed by another loop
accually doing a Non Temporal quadword copy. This reduced the read vs
write bus turnaround times enough to get numbers that were significantly
faster than nearly any other method.
So, its possible, right now given proper code. |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Tue Nov 01, 2005 1:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
Jeremy Linton wrote:
| Quote: | AMD had a very nice document they wrote a few years ago about how to
get max throughput with memory copy operations, where they compared
diffrent methods and instructions for doing the memory copy. If I
remember correctly in the end they got nearly theoritical bandwidth
numbers by doing a simple loop to preread (with actual register load
instead of prefretch) cache block size reads followed by another loop
accually doing a Non Temporal quadword copy. This reduced the read vs
write bus turnaround times enough to get numbers that were significantly
faster than nearly any other method.
|
Afair, that optimization was in regard to doing a simple set of fp
operations on a block of data, where it turned out that the fastest way
was to move everything three times:
First the max speed pre-read loop, then an operate loop, storing to a
fixed half L1 sized buffer, then finally NT stores to move the result
block to the final destination.
| Quote: |
So, its possible, right now given proper code.
|
Or in this case, quite horribly overcomplicated code. :-(
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Scott A Crosby
Guest
|
Posted:
Tue Nov 01, 2005 1:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
On Mon, 31 Oct 2005 18:04:01 GMT, Jeremy Linton <replytothelist@nospam.com> writes:
| Quote: | Ahhh.. Maybe you should look at the MTRR's (memory Type range
registers) PAT (Page Attribute Table) and the prefetch and
Non-temporal instructions already provided. I've seen near
therotical throughput numbers both on the memory subsystem and
on PCI busses given properly tuned code. Its not that you
can't control such things with the x86 its just that I haven't
seen a compiler generate optimal code.
AMD had a very nice document they wrote a few years ago about
how to get max throughput with memory copy operations, where
|
Would you happen to know the URL? I'd like to read this document.
| Quote: | This reduced the read vs write bus turnaround times enough to
get numbers that were significantly faster than nearly any
other method.
|
In particular, for this memory turnaround effect you've mentioned?
Scott |
|
| Back to top |
|
 |
Yousuf Khan
Guest
|
Posted:
Tue Nov 01, 2005 9:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
Oliver S. wrote:
| Quote: | And let's say it'll have 32 FP registers instead of just 16 like SSE
does.
Of course 32 registers would be better than 16, but I think we're well
behind a critical point with 16 fp-registers. I think these large regis-
ter-sets we see today on newer architectures exist rather because they
are easy to implement in a cpu than because of their necessity; in dif-
ferent words: the benefit of 32 or more registers isn't very high in
most cases, but their cost in terms of the chip-design is rather low
when your register-file shouldn't become too large.
|
Can't disagree with that.
Yousuf Khan |
|
| Back to top |
|
 |
Yousuf Khan
Guest
|
Posted:
Tue Nov 01, 2005 9:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
David Hopwood wrote:
| Quote: | Rob Stow wrote:
In an eight-way system most
are one hop away, while a few are two hops away.
No again. This would be the ideal 8P Opty 8xx scheme:
CPU6-----------------CPU7
| \ / |
| \ / |
| CPU4------CPU5 |
| | | |
| | | |
| CPU2------CPU[3] |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset
Hence, there are 11 one-hops, 12 two-hops, and 5 three-hops.
That's not optimal:
CPU6--------------CPU7
| \_____ ____/ |
| \ / |
| X |
| / \ |
| CPU4---CPU5 |
| | | |
| | | |
| CPU2---CPU3 |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset
11 one-hops, 16 two-hops, and 1 three-hop.
|
I don't get it, your diagram seems to be only a different permutation of
Rob's diagram. The only difference, in yours is that you got CPU5
connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to
CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't
represent a shortcut, it represents one line going over the other but
not touching.
Listing all of the 3 hop combinations in yours and Rob's, this is what I
get.
Rob:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7
5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6
David:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6
5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7
Only #4 & #5 are different between your two respective diagrams.
Yousuf Khan |
|
| Back to top |
|
 |
Yousuf Khan
Guest
|
Posted:
Tue Nov 01, 2005 9:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
David Kanter wrote:
| Quote: | how about some form of SMT for AMD?
I don't know that might come too, but it can't be done as easily as
Hyperthreading. Hyperthreading relied on the Pentium 4's inherent
inefficiency to run a lot of threads simultaneously.
If you think that any modern MPU is efficient, you are smoking crack.
They all have plenty of unused cycles left on the table (except when
running linpack).
|
But the secret is to have enough idle cycles to run both threads at
close to full speed each. I'd say anything that had enough to run both
threads at 80% full speed, was a reasonably successful SMT. |
|
| Back to top |
|
 |
Yousuf Khan
Guest
|
Posted:
Tue Nov 01, 2005 9:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
Stephen Fuld wrote:
| Quote: | Is there some technical reason behind the limitation to three HT links or
was it a marketing decision? If the latter, then it doesn't seem like it
would be a big deal, if larger systems seems to be a bigger market, to add
another link (or even two). The HT links must be a pretty small amount of
silicon and a small number of pins. Does that make sense?
|
I don't think there was any technical or marketing reason behind
limiting it to 3 HTT links per processor. It may have simply been a "we
need to keep the number HTT links and their pin counts within a
reasonable amount"-type decision. I'm sure they can add even more HT
links in the future.
Yousuf Khan |
|
| Back to top |
|
 |
Yousuf Khan
Guest
|
Posted:
Tue Nov 01, 2005 9:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
Oliver S. wrote:
| Quote: | If it added instructions to explicitly prefetch data from another
processor then it would probably have a gain in performance.
These instructions wouldn't work better than the prefetching-instructions
currently implemented. I think it would be cleverer to copy hw-scouting
from Sun's upcoming CPUs. HW-scouting is simple to implement if you're
going to have a SMT-core anyway.
|
So what's HW-scouting?
Yousuf Khan |
|
| Back to top |
|
 |
David Brown
Guest
|
Posted:
Tue Nov 01, 2005 9:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
Yousuf Khan wrote:
| Quote: | David Hopwood wrote:
Rob Stow wrote:
In an eight-way system most
are one hop away, while a few are two hops away.
No again. This would be the ideal 8P Opty 8xx scheme:
CPU6-----------------CPU7
| \ / |
| \ / |
| CPU4------CPU5 |
| | | |
| | | |
| CPU2------CPU[3] |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset
Hence, there are 11 one-hops, 12 two-hops, and 5 three-hops.
That's not optimal:
CPU6--------------CPU7
| \_____ ____/ |
| \ / |
| X |
| / \ |
| CPU4---CPU5 |
| | | |
| | | |
| CPU2---CPU3 |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset
11 one-hops, 16 two-hops, and 1 three-hop.
I don't get it, your diagram seems to be only a different permutation of
Rob's diagram. The only difference, in yours is that you got CPU5
connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to
CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't
represent a shortcut, it represents one line going over the other but
not touching.
Listing all of the 3 hop combinations in yours and Rob's, this is what I
get.
Rob:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7
5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6
David:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6
5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7
Only #4 & #5 are different between your two respective diagrams.
Yousuf Khan
|
The cross-over gives short-cuts for #2 to #5 (#0, CPU0-CPU1, is still a
3 hop):
2: CPU0-CPU5: 0-6-5
3: CPU1-CPU4: 1-7-4
4: CPU2-CPU6: 2-0-6
5: CPU3-CPU7: 3-1-7
mvh.,
David |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Tue Nov 01, 2005 9:15 am Post subject:
Re: AMD to leave x86 behind? |
|
|
Yousuf Khan wrote:
| Quote: | David Hopwood wrote:
That's not optimal:
CPU6--------------CPU7
| \_____ ____/ |
| \ / |
| X |
| / \ |
| CPU4---CPU5 |
| | | |
| | | |
| CPU2---CPU3 |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset
11 one-hops, 16 two-hops, and 1 three-hop.
I don't get it, your diagram seems to be only a different permutation of
Rob's diagram. The only difference, in yours is that you got CPU5
connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to
CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't
represent a shortcut, it represents one line going over the other but
not touching.
Listing all of the 3 hop combinations in yours and Rob's, this is what I
get.
Rob:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7
5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6
David:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6
5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7
Only #4 & #5 are different between your two respective diagrams.
|
I think you've missed a key feature of that cross:
2: CPU0-CPU5: 0-6-5
3: CPU1-CPU4: 1-7-4
4: CPU2-CPU6: 2-0-6
5: CPU3-CPU7: 3-1-7
I.e. only the CPU0-CPU1 link has to pass over three hops.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Jeremy Linton
Guest
|
Posted:
Tue Nov 01, 2005 5:15 pm Post subject:
Re: AMD to leave x86 behind? |
|
|
Scott A Crosby wrote:
| Quote: | On Mon, 31 Oct 2005 18:04:01 GMT, Jeremy Linton <replytothelist@nospam.com> writes:
Ahhh.. Maybe you should look at the MTRR's (memory Type range
registers) PAT (Page Attribute Table) and the prefetch and
Non-temporal instructions already provided. I've seen near
therotical throughput numbers both on the memory subsystem and
on PCI busses given properly tuned code. Its not that you
can't control such things with the x86 its just that I haven't
seen a compiler generate optimal code.
AMD had a very nice document they wrote a few years ago about
how to get max throughput with memory copy operations, where
Would you happen to know the URL? I'd like to read this document.
Took me a little while but eventually I ended up with the correct |
google incantation...
http://cdrom.amd.com/devconn/events/gdc_2002_amd.pdf page 21: "Reading a
large block, then writing a large block, causes the smallest number of
read/write mode changes in the memory module" I'm not sure how much of
their results are due to bus turnaround vs address update savings. I'm
to lazy to lookup the memory timings and do the math right now.
You should also look at
http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/25112.PDF
or the intel equivalent.
| Quote: |
This reduced the read vs write bus turnaround times enough to
get numbers that were significantly faster than nearly any
other method.
In particular, for this memory turnaround effect you've mentioned?
In general, on bi-directional busses there are sometimes timing |
penalties for changing the state of the read/write line. In many cases
these get hidden in the sequence to update the source address as well
before you can start the burst cycle. Pick your favorite bus to get
specifics on random access vs burst mode and any read vs write penalties.
A quick google search yielded this site:
http://www.commsdesign.com/design_corner/showArticle.jhtml?articleID=16700780
Which has some very cool graphs comparing different memory technologies
and the transfer rates based on things like the read vs write ratios.
There are some other tasty graphs there too. |
|
| Back to top |
|
 |
Eric P.
Guest
|
Posted:
Tue Nov 01, 2005 5:15 pm Post subject:
Re: AMD to leave x86 behind? |
|
|
Yousuf Khan wrote:
| Quote: |
Tim McCaffrey wrote:
Performance doesn't necessarily mean FP performance. For instance, the
x86 interrupt model really needs a re-think (especially since x86-64
doesn't really support the segmentation model). Something like
the ARM's register sets would be nice.
What specifically about the interrupt model is the problem?
|
The x86 takes an awful lot (thousands) of clocks to do what common
sense tells you shouldn't cost much more than a pipeline drain.
If you wrote down a description of the functionality required
for interrupt handling, then compared your design to how x86
actually works, you can see the appalling cost.
If you look at the x86 System Programmers guide you'll get some
idea why. Segmentation, RPL-DPL protection checks, call gates,
hardware tasking, all of which are not used but cost overhead.
Sadly the x64 continues to drag much of that baggage with it
because it's design must straddle both worlds. Remember that
the x64 had to boot looking like an x86 for compatibility,
so it even supports real and virtual 8086 modes (just in case).
They did fix a couple of things. The CS, DIP, SS, ESP, FLAGS state
record saved on the kernel stack is at least always the same shape
and is no longer dependent on whether the prior mode was User or Super
(which meant that kernel code always had to check the prior mode
before accessing the state record).
Unfortunately I doubt that will changed. It works so move on.
Nobody is not buying PC's because of its' interrupt handling.
Eric |
|
| Back to top |
|
 |
Jeremy Linton
Guest
|
Posted:
Tue Nov 01, 2005 5:15 pm Post subject:
Re: AMD to leave x86 behind? |
|
|
Eric P. wrote:
| Quote: | Yousuf Khan wrote:
Tim McCaffrey wrote:
Performance doesn't necessarily mean FP performance. For instance, the
x86 interrupt model really needs a re-think (especially since x86-64
doesn't really support the segmentation model). Something like
the ARM's register sets would be nice.
What specifically about the interrupt model is the problem?
The x86 takes an awful lot (thousands) of clocks to do what common
sense tells you shouldn't cost much more than a pipeline drain.
If you wrote down a description of the functionality required
for interrupt handling, then compared your design to how x86
actually works, you can see the appalling cost.
If you look at the x86 System Programmers guide you'll get some
idea why. Segmentation, RPL-DPL protection checks, call gates,
hardware tasking, all of which are not used but cost overhead.
Sadly the x64 continues to drag much of that baggage with it
because it's design must straddle both worlds. Remember that
the x64 had to boot looking like an x86 for compatibility,
so it even supports real and virtual 8086 modes (just in case).
I'm not sure how much of that is really the problem, a few years ago I |
did some basic back of the envelope calculations in a similar
discussion. My calculations were based on 100% cache misses for doing
the dozen or so memory reads required to pull in the interrupt vector in
when there wasn't a TLB entry for the IDT or the destination address.
Given a few hundred cycles of latency for each read it quickly became
apparent that short of disabling paging and removing the ability to move
the IDT around in memory it wasn't going to get much faster.
Compared with my ARM7TDMI, which doesn't have paging, and the ISR
vector is fixed in memory (and thereby saved a bunch of time dealing
with memory translation). The real advantage of the ARM was the separate
register space which allowed it to avoid having to setup a separate
stack space. Of course my generic operating environment basically then
has to setup the stack space manually for 99% of the interrupts so that
wasn't really an advantage in the long run. This reinforced what I
thought after working with a PPC interrupt handler, it could handle the
interrupt faster, but then you spend 10x as long manually dealing with
all the crap that the interrupt handler didn't do for you. In the end
its a wash for general purpose computing. On the other hand when your
talking about a hard real time systems with a short latency requirement
the FIQ mode on the ARM starts to look really good. In those enviroments
you don't gererally give a crap about 99% of the stuff a normal OS does
in an interrupt hander, so its not really a fair comparison. |
|
| Back to top |
|
 |
Del Cecchi
Guest
|
Posted:
Tue Nov 01, 2005 5:15 pm Post subject:
Re: AMD to leave x86 behind? |
|
|
"Yousuf Khan" <bbbl67@ezrs.com> wrote in message
news:dtJ9f.4402$LF3.395983@news20.bellglobal.com...
| Quote: | Terje Mathisen wrote:
I don't get it, your diagram seems to be only a different permutation
of
Rob's diagram. The only difference, in yours is that you got CPU5
connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to
CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't
represent a shortcut, it represents one line going over the other but
not touching.
Listing all of the 3 hop combinations in yours and Rob's, this is what
I
get.
Rob:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7
5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6
David:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6
5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7
Only #4 & #5 are different between your two respective diagrams.
I think you've missed a key feature of that cross:
2: CPU0-CPU5: 0-6-5
3: CPU1-CPU4: 1-7-4
4: CPU2-CPU6: 2-0-6
5: CPU3-CPU7: 3-1-7
I.e. only the CPU0-CPU1 link has to pass over three hops.
Okay, I stand corrected. Now one question arises from this. How do the
Opterons themselves know how to handle the message passing between
themselves? Do they just broadcast out in every direction and hope it
gets there with the smallest route, and ignore any duplicates coming in
later from non-optimal routes? Or is there a lookup table that gets
programmed into each CPU telling it which direction to send each
message?
Yousuf Khan
|
I have never heard that the multiple HT ports include a switch or a
routing table. My guess is that a broadcast protocol is used.
del |
|
| Back to top |
|
 |
Rob Stow
Guest
|
Posted:
Tue Nov 01, 2005 5:15 pm Post subject:
Re: AMD to leave x86 behind? |
|
|
Yousuf Khan wrote:
| Quote: | David Hopwood wrote:
Rob Stow wrote:
In an eight-way system most
are one hop away, while a few are two hops away.
No again. This would be the ideal 8P Opty 8xx scheme:
CPU6-----------------CPU7
| \ / |
| \ / |
| CPU4------CPU5 |
| | | |
| | | |
| CPU2------CPU[3] |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset
Hence, there are 11 one-hops, 12 two-hops, and 5 three-hops.
That's not optimal:
CPU6--------------CPU7
| \_____ ____/ |
| \ / |
| X |
| / \ |
| CPU4---CPU5 |
| | | |
| | | |
| CPU2---CPU3 |
| / \ |
| / \ |
CPU0 CPU1
| |
| |
Chipset Chipset
11 one-hops, 16 two-hops, and 1 three-hop.
I don't get it, your diagram seems to be only a different permutation of
Rob's diagram. The only difference, in yours is that you got CPU5
connecting to CPU6 and CPU4 to CPU7, whereas in Rob's it was CPU4 to
CPU6 & CPU5 to CPU7. That little "x" you put in between doesn't
represent a shortcut, it represents one line going over the other but
not touching.
Listing all of the 3 hop combinations in yours and Rob's, this is what I
get.
Rob:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
4: CPU2-CPU7: 2-4-5-7 or 2-3-5-7
5: CPU3-CPU6: 3-2-4-6 or 3-5-4-6
David:
1: CPU0-CPU1: 0-2-3-1
2: CPU0-CPU5: 0-2-3-5 or 0-2-4-5
|
Two hop route = 0-6-5
| Quote: | 3: CPU1-CPU4: 1-3-2-4 or 1-3-5-4
|
1-7-4
| Quote: | 4: CPU2-CPU6: 2-4-5-6 or 2-3-5-6
|
2-0-6
| Quote: | 5: CPU3-CPU7: 3-2-4-7 or 3-5-4-7
|
3-1-7
| Quote: | Only #4 & #5 are different between your two respective diagrams.
Yousuf Khan
|
I knew I'd gotten something wrong when I had so many 3-hops, but
unfortunately I'd already hit "send" when that thought hit me. I
was too focused on someone else's statement that "most are one
hops while a few are two hops away". |
|
| Back to top |
|
 |
|
|
|
|