| Author |
Message |
Guest
|
Posted:
Tue Nov 01, 2005 9:15 am Post subject:
Memory controller state of the art? |
|
|
How sophisticated are current and near future memory controllers?
ISTR reading that the Alpha 21364 (EV7) memory controllers have a
somewhat simple albeit intelligent page-open policy, but I do not
recall reading anything about more sophisticated features like
prioritizing accesses according to urgency (e.g., demand reads
having priority over bufferable writes and prefetches) or to
maximize throughput (e.g., reordering accesses to minimize the
number of page closings and openings).
An integrated memory controller could also enqueue accesses that
have not yet been confirmed as L2/L3 cache misses, checking
whether the DRAM page is open and performing some prioritization
work while waiting for the access to be canceled or confirmed.
Integration also makes more aggressive prefetching and more
intelligent and aggressive eager writeback feasible because
likely to be dropped prefetch requests can be sent (only wasting
less expensive on-chip bandwidth) and eager writeback can
interact with the knowledge of which DRAM pages are open and
busy. (Obviously memory attached to a remote controller would
have to be handled more conservatively, though the connection to
an adjacent memory controller might have enough surplus bandwidth
to allow the open DRAM page tags to be mirrored allowing
potential prefetches [or eager writebacks] to adjacent memories
to be prioritized and possibly delayed.)
Providing a specialized zeroed-page cache at the memory
controller might be worthwhile (though one might also want to
flag zero-page TLB entries as well to reduce latency of cache
block zeroing on a read or a page zeroing on a write).
An FB-DIMM-like architecture could allow for further complexity
in memory access scheduling by allowing write buffers and
prefetch caches to be located and partially managed at the memory
module (managing memory controller to buffer chip and buffer chip
to DRAM [and ideally on-DRAM SRAM buffer/cache to DRAM array]
bandwidth). Placing a cache on the first and/or second FB-DIMM
could allow more aggressive partial (e.g., distant FB-DIMM to
near FB-DIMM cache) prefetching since distant FB-DIMM to near
FB-DIMM bandwidth is cheaper than first FB-DIMM to memory
controller bandwidth. (ISTR reading that initial FB-DIMM memory
sytems will not take advantage of the potential NUMA character of
the architecture, which is a bit disappointing. BTW, would it be
reasonable to integrate the first buffer chips very near the
processor/memory controller [possibly the same package?] and have
the first modules share the processor board to provide higher
bandwidth from the first buffer chips and lower latency to the
first few GiB of memory?)
Paul A. Clayton
a babbling not-quite-idiot :-\ |
|
| Back to top |
|
 |
Iain McClatchie
Guest
|
Posted:
Wed Nov 02, 2005 1:16 am Post subject:
Re: Memory controller state of the art? |
|
|
I note that if the cache line is 64B and the DDR memory channel
is 64 or 72b, then you have 8 bits on the data pins per transaction,
and 4 bits on the address pins per transaction. A
read keeps the bank busy for many cycles, maybe 8-10.
RAS and CAS only take 2, so you can issue 2 more. If a read keeps
a bank busy for many cycles, say 10, and there are 4 banks then if
the data bus is saturated the banks are only a little more than
half busy on average. So we have the opportunity and motive to
kick off some speculative RAS operations. Speculative CAS is
different since they chew up the data bus, which has nominally
higher utilization.
Some things are pretty clear:
* If you have a load miss, you'll want to get that RAS and CAS out
ahead of any writes to the same bank, unless the write buffer is
full.
Some things are not:
* When you have a free address bus slot coming up, if you fire a
speculative RAS, the benefit of lower latency if you are right
(and a load miss does need the data, maybe 3 address cycles
won) has to outweigh the loss if you are wrong (some other load
miss to that bank needs data, maybe 6-10 address cycles lost
because the bank has to be precharged).
* Speculating that the last RAS to a bank will also cover the next
access has to be given extra weight, since firing a different
RAS costs power.
Here's my final twist: My guess is that lots of business laptops
do not need 3D graphics, and would benefit more from the power
and cost savings of having graphics and CPU on the same die,
using the same DRAM. So with both devices hitting the same DRAMs,
and advance/speculative information about CPU accesses, are there
interesting strategies to get bandwidth to the GPU and low latency
to the CPU?
Intel, ATI, and Nvidia have of course seen this issue in their
integrated northbridge/graphics products, but they have not dealt
with it in the presence of speculative information from the CPU. |
|
| Back to top |
|
 |
Ken Hagan
Guest
|
Posted:
Thu Nov 03, 2005 4:28 pm Post subject:
Re: Memory controller state of the art? |
|
|
Iain McClatchie wrote:
| Quote: |
Here's my final twist: My guess is that lots of business laptops
do not need 3D graphics, and would benefit more from the power
and cost savings of having graphics and CPU on the same die,
using the same DRAM. So with both devices hitting the same DRAMs,
and advance/speculative information about CPU accesses, are there
interesting strategies to get bandwidth to the GPU and low latency
to the CPU?
|
I'd have thought a GPU was almost entirely redundant for such laptops,
so we could just do all the display stuff in software, like we did 15
or 20 years ago. A GPU emulation ought to be highly parallel, so it
ought to be a good fit for N-way SMT. And when you *weren't* playing
on Google Earth, you could use the extra CPU power for the benefit of
whoever paid for the laptop. :) |
|
| Back to top |
|
 |
Iain McClatchie
Guest
|
Posted:
Fri Nov 04, 2005 7:52 am Post subject:
Re: Memory controller state of the art? |
|
|
Ken> I'd have thought a GPU was almost entirely redundant for such
laptops,
Ken> so we could just do all the display stuff in software, like we did
15
Ken> or 20 years ago.
Well:
- 2D compositing hardware, for things like antialiased text and
presentation
graphics, is very, very small.
- Doing stuff in hardware takes a lot less power than doing it in
software.
So if you can afford the hardware (it's small enough to integrate,
and you
will be using it often enough), throw it in.
Frankly, I thought the PA-semi presentation looked like the future.
Not exactly
a laptop chip (CPU core too slow, no graphics, no RAMDAC), but close.
Didn't AMD have a Geode thing that was basically what I'm talking
about,
only not fast enough to be satisfying under MS Windows? |
|
| Back to top |
|
 |
Niels Jørgen Kruse
Guest
|
Posted:
Fri Nov 04, 2005 3:48 pm Post subject:
Re: Memory controller state of the art? |
|
|
Iain McClatchie <iain-3@truecircuits.com> wrote:
| Quote: | I note that if the cache line is 64B and the DDR memory channel
is 64 or 72b, then you have 8 bits on the data pins per transaction,
and 4 bits on the address pins per transaction. A
read keeps the bank busy for many cycles, maybe 8-10.
RAS and CAS only take 2, so you can issue 2 more.
|
I thought RAS and CAS counted base clocks, not data rate clocks.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark |
|
| Back to top |
|
 |
David Wang
Guest
|
Posted:
Fri Nov 04, 2005 8:50 pm Post subject:
Re: Memory controller state of the art? |
|
|
MitchAlsup@aol.com wrote:
| Quote: | All of your musings in the first two paragraphs are implemented in the
Opteron On-die memory controllers.
But I think you are way overestimating the utility of FB-DIMM
technology. I see FB-DIMMs are a size play (4X) at the cost of latency
(+20%). Therefore those applications (like data bases) that respond
well to size increases at the cost of added latency will enjoy the
benefits of FB-DIMM. I am sure that business/productivity applications
will loose performance on FB-DIMM systems, and I believe that 3X as
many FP applications will loose performance as will gain performance
with FB-DIMM. And this comes from the viewpoint of having run these
applications on a very agressive "memory and DRAM controller"
simulators.
|
Is there an alternative?
Suppose you stay with a conventional Opteron-like integrated
memory controller for DDR2 and then DDR3 devices.
You're going to lose half of your DRAM devices per channel, which
can be made up with higher density devices of that time frame, but
that means your max capacity will be stagnant relatively to today's
max capacity.
Is that a good choice? Keep the low latency but no increases in
memory capacity for the forseeable future?
I suppose that isn't the end of the world in terms of business
and productivity applications, but I wonder what will happen with
DDR4 devices as we go forward... Lose another half of your device
count? So accept the capacity limit for the forseeable future?
I guess future processors will have to pick their poison, limit
the capacity or give up the latency.
What about going assymmtric?
Use GDDRx as your direct connect local memory like GPU's do,
and connect to FBDIMM's for the capacity. So CPU's can be sold
like GPU cards with DRAM tied to it, and "main memory" is farther
away through some FBD channels.
--
davewang202(at)yahoo(dot)com |
|
| Back to top |
|
 |
Guest
|
Posted:
Sat Nov 05, 2005 1:15 am Post subject:
Re: Memory controller state of the art? |
|
|
MitchAlsup@aol.com wrote:
| Quote: | All opf your musings in the first two paragraphs are implemented in the
Opteron On-die memory controllers.
|
Is there freely available documentation on such? I was not aware that
any processor implemented eager writeback or variable quality
prefetch operations. I am also surprised that Opteron's memory
controller would be so aggressive as to insert requests into the
MC pipeline before the L2 cache tags have been checked (it seems
like a lot of complexity for a few percent improvement in memory
latency and that only for node-local memory).
| Quote: | But I think you are way overestimating the utility of FB-DIMM
technology. I see FB-DIMMs are a size play (4X) at the cost of latency
(+20%). Therefore those applications (like data bases) that respond
well to size increases at the cost of added latency will enjoy the
benefits of FB-DIMM. I am sure that business/productivity applications
|
Capacity certainly has some desktop performance implications, and
this would be more the case if more aggressive prefetch from disk
was implemented in the OS. Partially starting up commonly used
applications before they are demanded could significantly improve
perceived responsiveness (and on a multi-core system the processing
requirements for such might not be problematic); similar benefits
might come from aggressive file prefetching. (FB-DIMM is slow, but
disk is even slower.)
(How much memory capacity does a desktop really want?)
I wish there was a way to get the pin bandwidth and capacity of
FB-DIMM while allowing for a relatively low latency portion of
memory. (David Wang's parallel memories might be workable,
but would seem to have a serious pin-count disadvantage. It
also assumes that page allocation to exploit the latency advantage
would not be a problem.) Making the first module especially
fast (and/or contain cached data from other modules) might be
beneficial, but one would still have the added latency of the AMB.
Anyway, I am not an FB-DIMM booster (I dislike the added latency
and cost); I just figured it was just the next step (especially given
the desire for a common server and desktop architecture).
| Quote: | will loose performance on FB-DIMM systems, and I believe that 3X as
many FP applications will loose performance as will gain performance
with FB-DIMM. And this comes from the viewpoint of having run these
applications on a very agressive "memory and DRAM controller"
simulators.
|
I am surprised that FP applications would be negatively affected by
FB-DIMM. I had thought they were more bandwidth constrained than
latency constrained.
Paul A. Clayton |
|
| Back to top |
|
 |
Guest
|
Posted:
Sat Nov 05, 2005 1:15 am Post subject:
Re: Memory controller state of the art? |
|
|
All opf your musings in the first two paragraphs are implemented in the
Opteron On-die memory controllers.
But I think you are way overestimating the utility of FB-DIMM
technology. I see FB-DIMMs are a size play (4X) at the cost of latency
(+20%). Therefore those applications (like data bases) that respond
well to size increases at the cost of added latency will enjoy the
benefits of FB-DIMM. I am sure that business/productivity applications
will loose performance on FB-DIMM systems, and I believe that 3X as
many FP applications will loose performance as will gain performance
with FB-DIMM. And this comes from the viewpoint of having run these
applications on a very agressive "memory and DRAM controller"
simulators. |
|
| Back to top |
|
 |
Del Cecchi
Guest
|
Posted:
Sat Nov 05, 2005 9:15 am Post subject:
Re: Memory controller state of the art? |
|
|
<Dysthymicdolt@aol.com> wrote in message
news:1131145922.492341.306900@z14g2000cwz.googlegroups.com...
| Quote: | MitchAlsup@aol.com wrote:
All opf your musings in the first two paragraphs are implemented in
the
Opteron On-die memory controllers.
Is there freely available documentation on such? I was not aware that
any processor implemented eager writeback or variable quality
prefetch operations. I am also surprised that Opteron's memory
controller would be so aggressive as to insert requests into the
MC pipeline before the L2 cache tags have been checked (it seems
like a lot of complexity for a few percent improvement in memory
latency and that only for node-local memory).
But I think you are way overestimating the utility of FB-DIMM
technology. I see FB-DIMMs are a size play (4X) at the cost of latency
(+20%). Therefore those applications (like data bases) that respond
well to size increases at the cost of added latency will enjoy the
benefits of FB-DIMM. I am sure that business/productivity applications
Capacity certainly has some desktop performance implications, and
this would be more the case if more aggressive prefetch from disk
was implemented in the OS. Partially starting up commonly used
applications before they are demanded could significantly improve
perceived responsiveness (and on a multi-core system the processing
requirements for such might not be problematic); similar benefits
might come from aggressive file prefetching. (FB-DIMM is slow, but
disk is even slower.)
(How much memory capacity does a desktop really want?)
I wish there was a way to get the pin bandwidth and capacity of
FB-DIMM while allowing for a relatively low latency portion of
memory. (David Wang's parallel memories might be workable,
but would seem to have a serious pin-count disadvantage. It
also assumes that page allocation to exploit the latency advantage
would not be a problem.) Making the first module especially
fast (and/or contain cached data from other modules) might be
beneficial, but one would still have the added latency of the AMB.
Anyway, I am not an FB-DIMM booster (I dislike the added latency
and cost); I just figured it was just the next step (especially given
the desire for a common server and desktop architecture).
will loose performance on FB-DIMM systems, and I believe that 3X as
many FP applications will loose performance as will gain performance
with FB-DIMM. And this comes from the viewpoint of having run these
applications on a very agressive "memory and DRAM controller"
simulators.
I am surprised that FP applications would be negatively affected by
FB-DIMM. I had thought they were more bandwidth constrained than
latency constrained.
Paul A. Clayton
My guess is that the target market for FB-DIMM and associated processors |
isn't the 150 dollar desktop where we are now. How much memory is
supported by the AMD processors? (I could go look it up but it's late
and I'm getting tired).
How does one attach many GB of memory to a processor without some
buffering scheme?
del |
|
| Back to top |
|
 |
Guest
|
Posted:
Sun Nov 06, 2005 12:03 am Post subject:
Re: Memory controller state of the art? |
|
|
David Wang wrote:
[snip]
| Quote: | Use GDDRx as your direct connect local memory like GPU's do,
and connect to FBDIMM's for the capacity. So CPU's can be sold
like GPU cards with DRAM tied to it, and "main memory" is farther
away through some FBD channels.
|
Would the fast memory be treated as hardware cache (i.e.,
modest block sizes, hardware allocation and replacement, etc.)
or as fast main memory managed by the OS? Given parallel
memories, one might need to be careful about bandwidth use
since the fast memory might not have much more bandwidth
than the main memory. Also, given only a moderate latency
difference, tying up bandwidth for a swap (or even a copy)
might not be as broadly advantageous as with an on-chip cache.
ISTM that one really wants to be able to use FB-DIMM-like
pass-through to minimize pincount/maximize bandwidth, but
this would require more complex DRAMs with a noticable
fraction of area used for control logic--higher price per bit
even if produced in high volume by multiple vendors. (Low-end
sytems could get by with just one rank of conventional DRAM.)
I assume this is highly impractical at current transistor densities.
Paul A. Clayton |
|
| Back to top |
|
 |
David Kanter
Guest
|
Posted:
Sun Nov 06, 2005 1:15 am Post subject:
Re: Memory controller state of the art? |
|
|
| Quote: | All opf your musings in the first two paragraphs are implemented in the
Opteron On-die memory controllers.
But I think you are way overestimating the utility of FB-DIMM
technology. I see FB-DIMMs are a size play (4X) at the cost of latency
(+20%).
|
The latency cost is not that big. 10ns for an unloaded memory
subsystem, but it has latency advantages when moderately or heavily
loaded.
| Quote: | Therefore those applications (like data bases) that respond
well to size increases at the cost of added latency will enjoy the
benefits of FB-DIMM.
|
| Quote: | I am sure that business/productivity applications
will loose performance on FB-DIMM systems,
|
What do you mean by business/productivity apps? Like excel?
| Quote: | and I believe that 3X as
many FP applications will loose performance as will gain performance
with FB-DIMM. And this comes from the viewpoint of having run these
applications on a very agressive "memory and DRAM controller"
simulators.
|
Perhaps, I'm less willing to believe this though. Higher bandwidth is
always a win for FP stuff....
David |
|
| Back to top |
|
 |
Iain McClatchie
Guest
|
Posted:
Sun Nov 06, 2005 9:15 am Post subject:
Re: Memory controller state of the art? |
|
|
Mitch> All opf your musings in the first two paragraphs are implemented
Mitch> in the Opteron On-die memory controllers.
Each Opteron chip knows the page open status of the remote Opterons?
:)
I presume that accesses to remote memories are not started
speculatively. If that's true, does it mean we can measure the gain
from the speculative local memory operations by comparing runtimes
on local memory versus runtimes on remote memory, and
subtracting the additional remote latency * cache misses? |
|
| Back to top |
|
 |
Guest
|
Posted:
Tue Nov 08, 2005 1:15 am Post subject:
Re: Memory controller state of the art? |
|
|
David Kanter wrote
[snip]:
| Quote: | [MitchAlsup wrote:]
and I believe that 3X as
many FP applications will loose performance as will gain performance
with FB-DIMM. And this comes from the viewpoint of having run these
applications on a very agressive "memory and DRAM controller"
simulators.
Perhaps, I'm less willing to believe this though. Higher bandwidth is
always a win for FP stuff....
|
I suppose this also depends on the interleaving. If the channels
are page interleaved, common case unit block stride accesses
would not take advantage of all the FB-DIMM channels (until the
prefetcher reached the page boundary). Interleaving that allows
a single block to be fetched from all (e.g., 4) channels in parallel
would lead to a simpler memory controller, but might not be
practical with moderate block sizes (e.g., at 64B, each channel
of a 4 channel system would only handle 16B; to accomplish
this with DDR2 DRAMs would require a 32b/36b wide [or
narrower] rank since DDR2 requires a burst length of 4--and
there is some incentive for burst lengths to increase--; also the
tag overhead presumably becomes more significant).
Presumably increasing the thread count per chip should make
FB-DIMM more attractive by increasing bandwidth demand and
distributing accesses more evenly among the channels. Mitch
might have run the simulations with one thread per memory node;
it seems unlikely that the simulations were for eight threads per
memory node.
Of course, it also depends on what one is keeping equal and what
values one uses: pincount, memory module count, number of
independent channels.
Paul A. Clayton
ignorant but somewhat willing to learn |
|
| Back to top |
|
 |
David Wang
Guest
|
Posted:
Tue Nov 08, 2005 5:17 pm Post subject:
Re: Memory controller state of the art? |
|
|
Dysthymicdolt@aol.com wrote:
| Quote: | David Wang wrote:
[snip]
Use GDDRx as your direct connect local memory like GPU's do,
and connect to FBDIMM's for the capacity. So CPU's can be sold
like GPU cards with DRAM tied to it, and "main memory" is farther
away through some FBD channels.
Would the fast memory be treated as hardware cache (i.e.,
modest block sizes, hardware allocation and replacement, etc.)
or as fast main memory managed by the OS?
|
Perhaps something managed by the TLB, with large block/page sizes.
With something that's in the hundreds of MB and small blocks, I
don't know how you'd build a tag structure to manage such a cache.
| Quote: | Given parallel
memories, one might need to be careful about bandwidth use
since the fast memory might not have much more bandwidth
than the main memory. Also, given only a moderate latency
difference, tying up bandwidth for a swap (or even a copy)
might not be as broadly advantageous as with an on-chip cache.
|
Just another level of optimizations. :)
The multiple FBD channels would probably still have more BW,
and if your CPU is n-way CMP or multi-threaded, then feeding
directly from the FBD channels might not be too bad of an idea.
you'd only want to swap into the local memory if you're running
the legacy productivity software stuff that has been described
to be latency critical.
I almost think that we could keep the region below 4GB as local
memory and keep all the FBD stuff above 4GB. Or somehow treat
this memory as the 0th hop FBD memory. I need to go talk to some
OS guys and see what the implications are....
| Quote: | ISTM that one really wants to be able to use FB-DIMM-like
pass-through to minimize pincount/maximize bandwidth, but
this would require more complex DRAMs with a noticable
fraction of area used for control logic--higher price per bit
even if produced in high volume by multiple vendors. (Low-end
sytems could get by with just one rank of conventional DRAM.)
I assume this is highly impractical at current transistor densities.
|
I *think* that proposing a new DRAM device just to get this done
is a non-starter. Hard to predict when you might battle to get
close to price parity with the commodity DRAM stuff, then the bottom
drops out of the DRAM market (again), and you find yourself with
a 5X multiple in price against the commodity stuff (again). :)
--
davewang202(at)yahoo(dot)com |
|
| Back to top |
|
 |
David Kanter
Guest
|
Posted:
Wed Nov 09, 2005 1:15 am Post subject:
Re: Memory controller state of the art? |
|
|
| Quote: | Just another level of optimizations. :)
The multiple FBD channels would probably still have more BW,
and if your CPU is n-way CMP or multi-threaded, then feeding
directly from the FBD channels might not be too bad of an idea.
you'd only want to swap into the local memory if you're running
the legacy productivity software stuff that has been described
to be latency critical.
I almost think that we could keep the region below 4GB as local
memory and keep all the FBD stuff above 4GB. Or somehow treat
this memory as the 0th hop FBD memory. I need to go talk to some
OS guys and see what the implications are....
|
It seems there are some FBD controller optimizations that could be
made.
1. Variable latency access. Let each DIMM communicate as fast as it
can, rather than at the speed of the slowest DIMM.
2. Using 1, pin all your latency sensitive data to DIMM 0.
I'm sure that smarter folks out there have more ideas to add to this
(or have already thought these up, but it can't hurt to mention it.
David |
|
| Back to top |
|
 |
|
|
|
|