Memory controller state of the art?
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Memory controller state of the art?
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Guest






Posted: Wed Nov 09, 2005 1:15 am    Post subject: Re: Memory controller state of the art? Reply with quote

Dysthymicdolt@aol.com wrote:
Quote:
MitchAlsup@aol.com wrote:
All opf your musings in the first two paragraphs are implemented in the
Opteron On-die memory controllers.

Is there freely available documentation on such?
No
I was not aware that
any processor implemented eager writeback or variable quality
prefetch operations. I am also surprised that Opteron's memory
controller would be so aggressive as to insert requests into the
MC pipeline before the L2 cache tags have been checked (it seems
like a lot of complexity for a few percent improvement in memory
latency and that only for node-local memory).

The Address path from the L2 cache to the MC is such that we insert the
request to DRAM into that pipe (simultaneously with L2 pipe) and kill
it off if the L2 takes a 'hit'.

Depending upon which Opteron you are talking about, the MC itself can
invent prefetch requests prior to the CPU requesting data (or prefetch
activity) and retain the data in a local buffer.

Quote:
Paul A. Clayton


Del Cecchi wrote:
Quote:
My guess is that the target market for FB-DIMM and associated processors
isn't the 150 dollar desktop where we are now. How much memory is
supported by the AMD processors? (I could go look it up but it's late
and I'm getting tired).

Right FB-DIMM is for systems who max out their DIMM count on each
motherboard and not for those who don't need maximum memory. Servers,
HPC, and many other apps need loads of memory. FB-DIMM is a big win for
them. Desktops, laptops, do not need FB-DIMM, at least not during the
first generation of FB-DIMM.

The number of systems that don't fit into this memory size constraint,
right now, represent just over 1 afternoon in the FAB per year.

Quote:
How does one attach many GB of memory to a processor without some
buffering scheme?

One can configure the DRAM frequency and timing (delay) to allow one to
put 16 GBytes on a double wide Opteron motherboard (8 DIMMs), or 8
GBytes on a single wide Opteron.

Per processor; so if you have a 4P multi using double wide Optis, you
can put 64GBytes into the system without having to jump into FB-DIMM.

Quote:
del

David Kanter wrote:
Quote:
All opf your musings in the first two paragraphs are implemented in the
Opteron On-die memory controllers.

But I think you are way overestimating the utility of FB-DIMM
technology. I see FB-DIMMs are a size play (4X) at the cost of latency
(+20%).

The latency cost is not that big. 10ns for an unloaded memory
subsystem, but it has latency advantages when moderately or heavily
loaded.

10ns addition to a typical read access time of 60ns is 16% right off
the top (uniprocessor), the overhead goes down a little in a multi as
most transactions are Probe bound not data delivery bound.

But then other timing irregularities come into play when one considers
larger units of time. Write data starts to interfere with read address
and command beats on FB-DIMM outbound path, and the DRC has to schedule
both the outgoing and inbound DRAM data beats to prevent collisions. it
all adds up.

Our simulators indicate that unless you need the capacity, most remain
better served by more conventional DDR-DRAM in both performance and
cost parameters. Your milage may vary.

Quote:
David

Iain McClatchie wrote:
Quote:
Mitch> All opf your musings in the first two paragraphs are implemented
Mitch> in the Opteron On-die memory controllers.

Each Opteron chip knows the page open status of the remote Opterons?

No Opteron processor knows if the memory controler even has DRAM pages
let alone the status thereof or even if the memory is local or remote.
Remote or local is not different (other than fast bypass paths).

The DRC does understand DRAM architecture and utilizes rather exotic
algorithms to attempt to decrease the average access time by leveraging
these architectural parameters.

Mitch
Back to top
Anton Ertl
Guest





Posted: Wed Nov 09, 2005 9:15 am    Post subject: Re: Memory controller state of the art? Reply with quote

MitchAlsup@aol.com writes:
Quote:
One can configure the DRAM frequency and timing (delay) to allow one to
put 16 GBytes on a double wide Opteron motherboard (8 DIMMs),

And IIRC the Opteron actually supports 4GB DIMMs, and it looks like
such DIMMs are available, so that would mean 32GB per CPU socket. Not
sure if the 4GB DIMMs are officially supported, though.

Quote:
or 8
GBytes on a single wide Opteron.

Do you mean Socket 754 chips (none of which are sold as Opterons)?
These only support non-registered DIMMs, so the available DIMMs are
smaller (are non-reg 2GB DIMMs available already?), and I doubt that
there are Socket 754 boards with more than three DIMM slots around;
are more than three DIMMs actually supported for non-registered DIMMs?

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
Iain McClatchie
Guest





Posted: Thu Nov 10, 2005 12:49 am    Post subject: Re: Memory controller state of the art? Reply with quote

Ooooh, grist for the mill from Mitch. Thanks, Mitch.

Mitch> The Address path from the L2 cache to the MC is such that we
Mitch> insert the request to DRAM into that pipe (simultaneously with
L2
Mitch> pipe) and kill it off if the L2 takes a 'hit'.

I see. So by the time you get the "L2 miss" signal, you know which
bank to go to, and whether it needs a RAS first. And maybe if there's
not been a lot of L1 misses lately and no RAS needed you fire a CAS
before the "L2 miss" signal.

Quote:
10ns addition to a typical read access time of 60ns is 16% right off
the top (uniprocessor), the overhead goes down a little in a multi as
most transactions are Probe bound not data delivery bound.

Seems like you need a bigger graduation queue, so you can plow
ahead once you have the data, and restart the pipe if you get back
newer data. I bet you can predict whether an access is going to
hit in someone else's cache really well, and so get some of the
advantage of a directory protocol without the extra RAM with funky
addressing patterns.

If I understand correctly, you can scale up the graduation queue
without making the integer register files bigger.

Mitch> No Opteron processor knows if the memory controler even
Mitch> has DRAM pages let alone the status thereof or even if the
Mitch> memory is local or remote.
Mitch> Remote or local is not different (other than fast bypass paths).

I see. The L2 miss->DRAM fetch path is not speculative, and goes
through the crossbar. The L1 miss feeds speculative information to
the local DRAM controller but not to any other.

I assume that means both cores in a dual-core feed L1 miss info
to the memory controller. That's definitely information that a 2
socket single-core-per-socket system doesn't get.

Are there any published benchmarks (aside from what Tom's
Hardware just put out) comparing an X2 Athlon system to two
single-core Opterons, when running independent jobs? I ask
because we're going to buy a few dozen of these things soon,
and won't have time to benchmark them ourselves.

Re: FBDIMM. Seems like AMD doesn't need these anywhere near
as badly as Intel. If there are enough customers who really need
the gigantic memories, it seems an enterprising startup could fab
a socket 940 memory controller, and plop it into boards with 8
sockets. Voila, 8 x 8 x 4GB = 256GB. Since the memory
controllers don't have caches, they wouldn't need to be probed,
although Opteron might not understand that. And since they don't
have cores, they don't need AMD's fast silicon.
Back to top
Anton Ertl
Guest





Posted: Thu Nov 10, 2005 1:07 am    Post subject: Re: Memory controller state of the art? Reply with quote

"Iain McClatchie" <iain-3@truecircuits.com> writes:
Quote:
Are there any published benchmarks (aside from what Tom's
Hardware just put out) comparing an X2 Athlon system to two
single-core Opterons, when running independent jobs? I ask
because we're going to buy a few dozen of these things soon,
and won't have time to benchmark them ourselves.

If you want 4 cores, or bigger memory, or higher memory bandwidth than
a Socket 939 can support, go for the two-socket systems. If not, go
for a dual-core Socket 939 system.

Quote:
Re: FBDIMM. Seems like AMD doesn't need these anywhere near
as badly as Intel. If there are enough customers who really need
the gigantic memories, it seems an enterprising startup could fab
a socket 940 memory controller, and plop it into boards with 8
sockets. Voila, 8 x 8 x 4GB = 256GB. Since the memory
controllers don't have caches, they wouldn't need to be probed,
although Opteron might not understand that. And since they don't
have cores, they don't need AMD's fast silicon.

Given the cost of the board, and of the memory, buying the cheapest
Opterons 8xx instead of some special memory controllers is probably a
relatively small additional cost, and it gives the purchaser nice
bragging rights. Therefore, I don't see a big market for such a
memory controller, and it would have to be substantially cheaper than
the cheapest Opteron 8xx, so no big margins, either.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
Greg Lindahl
Guest





Posted: Thu Nov 10, 2005 1:15 am    Post subject: Re: Latency for FBD v. DDR1/2 Reply with quote

In article <1131580440.903246.218830@g49g2000cwa.googlegroups.com>,
David Kanter <dkanter@gmail.com> wrote:

Quote:
AMD and current Intel systems substantially lower the memory speed
in order to get high capacity configurations.

Although you can use expensive memory to work around this -- the
"Emerald" cluster at AMD, a 576-core InfiniPath interconnect
benchmarking machine, uses "single rank" modules from Samsung, which
allow 4 gbyte/processor without any slowdown. We really like it.

-- greg
(working for, not speaking for, PathScale.)
Back to top
Andy Glew
Guest





Posted: Thu Nov 10, 2005 1:15 am    Post subject: Re: Memory controller state of the art? Reply with quote

Quote:
I presume that accesses to remote memories are not started
speculatively.

I mean, throttling of speculative accesses to remote memories may be
of interest in big systems, bandwidth limited, etc.

But not to small systems. Not to systems that are not bandwidth
limited.

I've long wanted to enable throttling of speculative accesses, but I
doubt very much that in anything exceppt a throughput processor do you
want such throttlig to be the default policy.

(Anecdote: P6 would have had a speculative throttling mechanism,
except for the fact that at one time there was supposed to be a P6
chip on a P5 bus and chipset. Hence we created the MTRRs instead.
The MTRRs are a kluge. Speculative throtttling would have been
better.)
Back to top
Andy Glew
Guest





Posted: Thu Nov 10, 2005 1:15 am    Post subject: Re: Memory controller state of the art? Reply with quote

"Iain McClatchie" <iain-3@truecircuits.com> writes:

Quote:
I presume that accesses to remote memories are not started
speculatively.

Do you mean speculatively, as in out-of-orrder branch prediction
speculatively?

Why do you think that Opteron, or any mass market, chip would not
initiate remote acccesses speculatively?
Back to top
David Kanter
Guest





Posted: Thu Nov 10, 2005 1:15 am    Post subject: Latency for FBD v. DDR1/2 Reply with quote

Quote:
Re: FBDIMM. Seems like AMD doesn't need these anywhere near
as badly as Intel. If there are enough customers who really need
the gigantic memories, it seems an enterprising startup could fab
a socket 940 memory controller, and plop it into boards with 8
sockets. Voila, 8 x 8 x 4GB = 256GB. Since the memory
controllers don't have caches, they wouldn't need to be probed,
although Opteron might not understand that. And since they don't
have cores, they don't need AMD's fast silicon.

I'm afraid I see quite a few draw backs to that. Principally, think
about what speed the memory will be running at. AMD and current Intel
systems substantially lower the memory speed in order to get high
capacity configurations. The largest HP configurations require you to
reduce the speed substantially: 128GB @ PC2100, 64GB @ PC2700, 32GB @
PC3200. The Sun systems are much the same way.

POWER5 can have upto 512GB of DDR2, but can only reach 2TB
configurations using DDR1, look at their TPC or SPEC submissions and
disclosures.

FBD will have better capacity at better latency and bandwidth. Intel
might have been fumbled quite a few things these last years, but FBD is
certainly not one of them. Generally, when Intel makes a move nobody
likes it is quite public; think about Rambus, for example, the DRAM
manufacturers fought that tooth and nail. I have yet to hear any
companies objecting to FBD and I've even heard faint praise from
competitors.

DK
Back to top
Guest






Posted: Thu Nov 10, 2005 1:15 am    Post subject: Re: Memory controller state of the art? Reply with quote

Iain McClatchie wrote:
Quote:
Mitch> The Address path from the L2 cache to the MC is such that we
Mitch> insert the request to DRAM into that pipe (simultaneously with
L2
Mitch> pipe) and kill it off if the L2 takes a 'hit'.

I see. So by the time you get the "L2 miss" signal, you know which
bank to go to, and whether it needs a RAS first. And maybe if there's
not been a lot of L1 misses lately and no RAS needed you fire a CAS
before the "L2 miss" signal.

Not really, in Opteron, at the point of L2-hit we are still wandering
through the X-bar and
have a slight chance of just having arrived at the MC.

Quote:
I assume that means both cores in a dual-core feed L1 miss info
to the memory controller.

The CPU L1s interface through a BIU to the L2 and SRI in parallel.
Both BIUs interface through the NorthBridge{ SRI ->X-Bar->MC }.

Quote:
Are there any published benchmarks (aside from what Tom's
Hardware just put out) comparing an X2 Athlon system to two
single-core Opterons, when running independent jobs? I ask
because we're going to buy a few dozen of these things soon,
and won't have time to benchmark them ourselves.

In the dual core (2p systems) there is a delicate balance between
having 60ns average access to memory and a single MC+DRC when compared
to a dual chip with 110ns average memory access and two independent
(MC+DRC)s. Latency bound applications prefer the former (dual core)
while throughput bound applications and footprint bound applications
prefer the later. Tom's benchmarks confirm this.

Quote:
Re: FBDIMM. Seems like AMD doesn't need these anywhere near
as badly as Intel.

Right.

Mitch
Back to top
Peter \"Firefly\" Lund
Guest





Posted: Thu Nov 10, 2005 9:15 am    Post subject: Re: Memory controller state of the art? Reply with quote

On Thu, 9 Nov 2005, Andy Glew wrote:

Quote:
(Anecdote: P6 would have had a speculative throttling mechanism,
except for the fact that at one time there was supposed to be a P6
chip on a P5 bus and chipset. Hence we created the MTRRs instead.
The MTRRs are a kluge. Speculative throtttling would have been
better.)

Ok, this is a chance to learn! :)

How would you have handled frame buffers without the MTRRs?
How would you have made memory-mapped I/O on the PCI bus uncacheable?

-Peter
Back to top
Kai Harrekilde-Petersen
Guest





Posted: Thu Nov 10, 2005 3:48 pm    Post subject: Re: Memory controller state of the art? Reply with quote

Andy Glew <andy.glew@intel.com> writes:

Quote:
I presume that accesses to remote memories are not started
speculatively.

I mean, throttling of speculative accesses to remote memories may be
of interest in big systems, bandwidth limited, etc.

But not to small systems. Not to systems that are not bandwidth
limited.

I've long wanted to enable throttling of speculative accesses, but I
doubt very much that in anything exceppt a throughput processor do you
want such throttlig to be the default policy.

(newbie alert, I don't follow uP architecture, so I could be terribly
wrong)

I see two immediate reasons for limiting speculative accesses to
memories: A) those accesses that are not correctly speculated are a
waste of power, and B) the speculative accesses might "steal"
bandwidth from non-speculative accesses.

From my point of view, you'd need either a mechanism to accept that
some of your speculative accesses simply got "killed" (ie no
response/data is returned), or that you return a "failed" response
(could be either a status flag or deliberately bogus data. I'd go for
the first option).

If you were to implement throttling of speculative accesses, how would
you do it?


Kai, curious and willing to learn
--
Kai Harrekilde-Petersen <khp(at)harrekilde(dot)dk>
Back to top
Guest






Posted: Fri Nov 11, 2005 12:16 am    Post subject: Re: Memory controller state of the art? Reply with quote

David Wang wrote:
Quote:
Dysthymicdolt@aol.com wrote:
[snip]
With something that's in the hundreds of MB and small blocks, I
don't know how you'd build a tag structure to manage such a cache.

Direct-mapped cache would not be a problem (other than the long
delay to determine a miss and initiate a memory fetch--and, of
course, the problems with such low associativity); a two-way set
associative cache with on-processor-chip way prediction would
probably not be problematic (MRU is probably not an especially
good predictor at higher associativities, correct?).

I favor a OS-managed memory.

Quote:
I almost think that we could keep the region below 4GB as local
memory and keep all the FBD stuff above 4GB. Or somehow treat
this memory as the 0th hop FBD memory. I need to go talk to some
OS guys and see what the implications are....

Somethings are more obvious candidates for low latency
memory. (I am guessing that modern OSes are well-tuned,
e.g., using structure splitting to separate hot data from
cold data where appropriate, and might adapt fairly quickly
to a two-level explicit memory hierarchy. Placing file cache
in higher latency memory might be sensible, especially for
prefetch and write buffers? Hot code might be a good
candidate for low latency memory. Page tables might be
also be appropriate. Application developers might be able
to easily distinguish between streaming data and pointer-rich
data; this would seem to work well with slab allocation. For
*n?x, providing a MAP_LOW_LATENCY hint flag for mmap
might be sufficient for applications to exploit the different
memories.)

Quote:
I *think* that proposing a new DRAM device just to get this done
is a non-starter. Hard to predict when you might battle to get
close to price parity with the commodity DRAM stuff, then the bottom
drops out of the DRAM market (again), and you find yourself with
a 5X multiple in price against the commodity stuff (again). :)

I suspect you are right. However, would it be possible to place a
buffer chip on the same board as more conventional DRAMs and
the memory controller such that the signal lines would be shared,
using chip select to have the DRAM ignore signals meant for the
buffer chip? (Such might not be electrically feasible; it would
certainly require much greater complexity in the memory controller,
which would have to arbitrate and schedule bus utilization.) Would
such even allow the use of non-differential signaling to the buffer
chip while maintaining high per-pin bandwidth? Would such be
made easier/faster with separate IO DRAMs (the DRAM command
lines are already unidirectional and presumably replicated, so such
might not be desirable)? Could SIO DRAMs be manufactured at
the same die cost as conventional DRAMs and packaged as Common
IO or SIO (allowing such to be low risk and low cost)?


Paul A. Clayton
not an EE or even a CS
Back to top
Guest






Posted: Fri Nov 11, 2005 12:29 am    Post subject: Re: Memory controller state of the art? Reply with quote

MitchAlsup@aol.com wrote:
Quote:
Not really, in Opteron, at the point of L2-hit we are still wandering
through the X-bar and
have a slight chance of just having arrived at the MC.

Okay, so it is not as aggressive an optimization as I thought it
might be. Still, it is a very nice optimization! I wonder what the
memory systems are like for POWER5 and UltraSparc IV.
Back to top
Andy Glew
Guest





Posted: Fri Nov 11, 2005 12:41 am    Post subject: Re: Memory controller state of the art? Reply with quote

"Peter \"Firefly\" Lund" <firefly@diku.dk> writes:

Quote:
On Thu, 9 Nov 2005, Andy Glew wrote:

(Anecdote: P6 would have had a speculative throttling mechanism,
except for the fact that at one time there was supposed to be a P6
chip on a P5 bus and chipset. Hence we created the MTRRs instead.
The MTRRs are a kluge. Speculative throtttling would have been
better.)

Ok, this is a chance to learn! :)

How would you have handled frame buffers without the MTRRs?
How would you have made memory-mapped I/O on the PCI bus uncacheable?

The original P6 design was supposed to tag requeests sent out the
fron-side bus as speculative. The chipset logic, e.g. that routed
requests to the PCI, was supposed to be able to respond NACK to a
request tagged speculative, indicating that it was memory mappped I/O
and should be retried when the request was no longer speculative.

To prevent excessive traffic, I wanted to cache the tupe of memory
returned from the chipset in the TLBs, so that successive requests to
the same pagedid not go through the speculative-NACK-nonspeculative
cycle.

---

At that time USWC memory did not exist - I defined it later.

However, we could have taken the same approach, of having the chipset
tell us about the memory typeof a region, and then cache it in the TLV
so that subsequent accesses could have been write combined.

Still, I think that we might still have used the MTRRs. But, they
would not have had to cover all of memory, which demands a large
number of MTRRs. Instead, they would have only had to cover the USWC
regions.

Or, we could have depended on the page tables. Although the page
tables came later - Lance hacking did not invent the PAT until after
P6 was over.

---

Flip side: the MTRRs might not have been so bad if they had allowed
the chipset logic to be simplified.

i486 and Pentium (P5) chipsets had to respond with the cacheability to
the processor on every requeest. Real fast. Always a timing path.

The P6 MTRRs removed the need to have the chipsets respond so quickly.
But the chipsets still needed all of the address logic, to route some
physical addresses to DRAM, some to PCI, etc.

At one time the P6 MTRRs were supposed to have several bits that would
go out onto the bus - the idea being that the MTRRs would be
programmed with the routing. Thus removing the chipset tables (except
for FSB bus masters...).
Back to top
Andy Glew
Guest





Posted: Fri Nov 11, 2005 12:49 am    Post subject: Re: Memory controller state of the art? Reply with quote

Quote:
I see two immediate reasons for limiting speculative accesses to
memories: A) those accesses that are not correctly speculated are a
waste of power, and B) the speculative accesses might "steal"
bandwidth from non-speculative accesses.

Yep.


Quote:
From my point of view, you'd need either a mechanism to accept that
some of your speculative accesses simply got "killed" (ie no
response/data is returned), or that you return a "failed" response
(could be either a status flag or deliberately bogus data. I'd go for
the first option).

If you were to implement throttling of speculative accesses, how would
you do it?

I'd go for the second option, a failed response - speculative NACK,
retry when nonspeculative.

Reason: Intel's bus designs require a response so that request buffers
can be deallocated. Simplifies design.

Timeouts require that you have (a) a timer, and (b) a mechanism to
ignore responses that return after the original request has timed out.
And (c) they also need a mechanism to ensure that such a response that
arrives after its original request has timed out and been deallocated
NOT be considered to be a response to a new request that happens to
use the same transaction ID, or be to the same address. Or prevent
such post-timeout responses from ever happening.

Doable - TCP/IP does exactly this - but it is more expensive than the
simple bus protocols.

I don't have srong opinions here. These are not my bus protocol
designs. but I have to live with them, and I try to understand the
motivations of their designers.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB