Future memory modules
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Future memory modules

 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Guest






Posted: Tue Nov 01, 2005 9:15 am    Post subject: Future memory modules Reply with quote

With the adoption of an FB-DIMM-like architecture (to maximize
capacity, to minimize memory controller pin count, and to
simplify board design), there is some new opportunity for certain
bandwidth and latency optimizations, but there is also a
decreased benefit from latency optimizations (with the added
latency of more chip crossings).

Bandwidth optimizations could include finer-grained writes than
supported by the underlying DRAMs much less the basic cache line
size (the buffer chip could perform a read-and-update exploiting
the larger bandwidth and lower latency between buffer chip and
the DRAMs; similarly, if a cache line is distributed among two or
more channels, only one channel might need to be used to perform
an update; one could also support read accesses smaller than
simply supported by the memory system), on-module addition
(especially for streaming accumulation), on-module search
(including min/max), on-module and module-to-module copy/swap
(this could include module-to-I/O device copy/swap if the memory
channels included or were terminated by an I/O connection chip,
allowing DMA to be handled with minimal interaction with the
memory controller) and on-module prefetch cache (this could allow
more aggressive DRAM page closing and opening, reducing
dead-time). (On-module scanning for correctable memory errors
could improve available bandwidth for a given degree of
reliability.)

While the added latency of passing through one or more buffer
chips may discourage the adoption of lower-latency DRAMs, the
added cost from the buffer chip may hide some of the extra cost
of lower-latency DRAMs. (Reduced latency and increased bandwidth
might be more reasonable for the first module. Aside from using
increased bandwidth to support on-module prefetch, it could be
used to support normal memory activity while swapping pages
between the first and more distant modules. Unfortunately,
outside of a small fraction of software which isolates frequently
used or latency-sensitive data into special pages, perhaps using
slab allocation, [and this would require additional OS support,
e.g., a MAP_FASTMEM flag in mmap, for application use] lower
latency modules might only be used by OS page placement management
[e.g., intially loading pages prefetched from disk to more
distant modules, perhaps migrating recently or frequently
accessed pages {or even pages that generate more demand
last-level cache misses} nearer the memory controller
{interestingly, the second module of an adjacent memory
controller might have lower latency than the eighth module of the
local memory controller, though likely lower bandwidth}].)

It is not clear that the physical abstraction layer and the
protocol layer is sufficient to encourage evolution (or
diversity) in module design. While the protocol layer can evolve
independently from the DRAMs (whose manufacturers are generally
focused on price/bit) and motherboards (whose manufacturers might
also be somewhat conservative), it is currently strongly
dependent on the memory controller. (On the positive side, it
should be possible to implement aligned burst hardware
optimizations without requiring a memory controller that only
issues aligned burst accesses; the buffer chip could split an
unaligned access into two aligned accesses. With a relatively
narrow data path to the controller [14 bits wide], a relatively
large [e.g., 64B + 8B ECC] aligned block read data return could
be initiated early by locating the first portion of the block in
faster memory in the DRAMs, in a special fast DRAM, or even on
the buffer chip. Unfortunately, the latency advantage would
probably be minimal--a few percent of the total latency--and the
cost of even one-eighth of the module using a special DRAM would
probably be significant. One could use the same access
pipelining technique to place part of a cache on the buffer chip
and part on a separate fast DRAM; however, it is not clear how
useful on-module caching would be, though at least it could be
abstracted from the memory controller.)

(On-DRAM row or row-section caching could be a significant
benefit since on-chip bandwidth is relatively inexpensive and the
buffer chip could hold the tags [so the memory controller could
be ignorant of which rows are cached, though an integrated memory
controller might have more information on which rows should be
cached]. Simple EDRAM [1 cached row] would allow early
low-penalty precharging, simplifying the memory controller. Even
with an integrated memory controller [minimum latency to memory]
and on-chip L3 [maximum difference between L3 hit latency and
memory access and constrained L3 capacity], it is not clear that
a lower-latency/higher-bandwidth module would be worth even a
small [20%] price premium given the likely minimal impact on
performance.)


Paul A. Clayton
sorry for the diarrhea of the keyboard :-(
Back to top
JJ
Guest





Posted: Tue Nov 01, 2005 5:01 pm    Post subject: Re: Future memory modules Reply with quote

Dysthymicdolt@aol.com wrote:
Quote:
With the adoption of an FB-DIMM-like architecture (to maximize
capacity, to minimize memory controller pin count, and to
simplify board design), there is some new opportunity for certain
bandwidth and latency optimizations, but there is also a
decreased benefit from latency optimizations (with the added
latency of more chip crossings).


snipping

Take a look at Micron RLDRAM2, 20ns latency for each of 8 banks all in
flight, with 2.5ns issue rate almost full random over entire address
space ignoring bank collisions. Parts at 256M,576M about 2x SDRAM $.
Bitline only has 32bits per line to get the much lower latency. v3
promise 15ns bank latency and 1.9ns issue rate (8 clocks still). Can
use 400MHz DDR for bursts. They use wide unmuxed pipelined SRAM like
interfaces. Maybe not main memory, perhaps L3 or even L2. Now how to
hide 8 clocks of latency?

not affiliated with these people

John Jakson
Back to top
Guest






Posted: Thu Nov 03, 2005 8:33 am    Post subject: Re: Future memory modules Reply with quote

JJ wrote:
[snip]
Quote:
Take a look at Micron RLDRAM2, 20ns latency for each of 8 banks all in
flight, with 2.5ns issue rate almost full random over entire address
space ignoring bank collisions. Parts at 256M,576M about 2x SDRAM $.
Bitline only has 32bits per line to get the much lower latency. v3
promise 15ns bank latency and 1.9ns issue rate (8 clocks still). Can
use 400MHz DDR for bursts. They use wide unmuxed pipelined SRAM like
interfaces. Maybe not main memory, perhaps L3 or even L2. Now how to
hide 8 clocks of latency?

RLDRAM2 may be suitable to networking applications, but it
seems to be poorly designed for an L3 memory. While the
latency might not be bad for such a potentially large cache
(e.g., 256 MiB using 4 chips), the design is not specifically
optimized for such usage (e.g., burst lengths of 2 and 4 [and
8 for 9b/18b wide chips] where a single large burst length
could have allowed even lower latency [at least the bursts do
seem to be required to be aligned], the multiplexed address
version does not allow for reduced latency [where an L3 could
speculate on a hit {or dirty line replacement or victim buffer
allocation} and send all but the last few {way selection}
address bits early]) and the need to explicitly refresh memory
might be a problem. The bandwidth might be acceptable even
given a moderate 72b wide data path.

Such a huge cache presents other problems, though. For anything
but a high-end server processor in which all configurations
include such off-chip cache, the tags would have to be
off-chip which would significantly increase latency for a
non-direct-mapped cache (one might be willing to place on-chip
a single way prediction bit per 512B of cache--a 64KiB
predictor array might not be excessively expensive). (I
suspect that high-end servers would pay the premium for 1T SRAM)
RLDRAM tag memory might be problematic in terms of providing
enough bandwidth at a small capacity (perhaps one might have
a fifth 16b- or 32b-wide chip contain tags and ECC [though
there is not a 512MiB RLDRAM], one could perform way prediction
and early miss detection with partial tags).

ISTM that RLDRAM might be more attractive as a parallel, fast
memory (software-managed cache). Unfortunately, OSes and
applications are not designed to exploit such.

(It might be desirable to stripe blocks across 7 of the 8 banks
to maximize throughput for sequential accesses and reduce bank
conflict [MOD 7 is not that slow to compute]. [This could also
reduce conflict misses in a cache.] Of course, there is then
the problem of what to do with the eigth bank.)


Paul A. Clayton
Back to top
JJ
Guest





Posted: Thu Nov 03, 2005 5:15 pm    Post subject: Re: Future memory modules Reply with quote

Dysthymicdolt@aol.com wrote:
Quote:
JJ wrote:
[snip]
Take a look at Micron RLDRAM2, 20ns latency for each of 8 banks all in
flight, with 2.5ns issue rate almost full random over entire address
space ignoring bank collisions. Parts at 256M,576M about 2x SDRAM $.
Bitline only has 32bits per line to get the much lower latency. v3
promise 15ns bank latency and 1.9ns issue rate (8 clocks still). Can
use 400MHz DDR for bursts. They use wide unmuxed pipelined SRAM like
interfaces. Maybe not main memory, perhaps L3 or even L2. Now how to
hide 8 clocks of latency?

RLDRAM2 may be suitable to networking applications, but it
seems to be poorly designed for an L3 memory. While the

Ofcourse many networking applications are highly interleaved so the
latency can be pretty well hidden and issue rate exploited, same with
some processor designs. RL has fast-large doodad written all over it,
just work with it some. Shame about the banking, it should have been
several times finer. Now if it had been 64K banks or so, the design
would effectively look like an 8 stage pipelined SRAM with full 2.5ns
issue rate.

Quote:
latency might not be bad for such a potentially large cache
(e.g., 256 MiB using 4 chips), the design is not specifically
optimized for such usage (e.g., burst lengths of 2 and 4 [and
8 for 9b/18b wide chips] where a single large burst length
could have allowed even lower latency [at least the bursts do

RLDRAMs are best used for their interleaved flat memory model with high
issue rates, not for bursty applications. If you only want bursty,
DDR3, or RDRAM might be better.

Quote:
seem to be required to be aligned], the multiplexed address
version does not allow for reduced latency [where an L3 could
speculate on a hit {or dirty line replacement or victim buffer
allocation} and send all but the last few {way selection}
address bits early]) and the need to explicitly refresh memory
might be a problem. The bandwidth might be acceptable even
given a moderate 72b wide data path.

Such a huge cache presents other problems, though. For anything
but a high-end server processor in which all configurations
include such off-chip cache, the tags would have to be
off-chip which would significantly increase latency for a

Tags add another 20% or more to width, but you are stuck in
conventional cache architecture.

Quote:
non-direct-mapped cache (one might be willing to place on-chip
a single way prediction bit per 512B of cache--a 64KiB
predictor array might not be excessively expensive). (I
suspect that high-end servers would pay the premium for 1T SRAM)
RLDRAM tag memory might be problematic in terms of providing
enough bandwidth at a small capacity (perhaps one might have
a fifth 16b- or 32b-wide chip contain tags and ECC [though
there is not a 512MiB RLDRAM], one could perform way prediction
and early miss detection with partial tags).

ISTM that RLDRAM might be more attractive as a parallel, fast
memory (software-managed cache). Unfortunately, OSes and
applications are not designed to exploit such.


No we don't really want that.

Quote:
(It might be desirable to stripe blocks across 7 of the 8 banks
to maximize throughput for sequential accesses and reduce bank
conflict [MOD 7 is not that slow to compute]. [This could also
reduce conflict misses in a cache.] Of course, there is then
the problem of what to do with the eigth bank.)



Nor that.

Quote:
Paul A. Clayton

John
Back to top
Guest






Posted: Fri Nov 04, 2005 1:15 am    Post subject: Re: Future memory modules Reply with quote

JJ wrote:
Quote:
Dysthymicdolt@aol.com wrote:
RLDRAM2 may be suitable to networking applications, but it
seems to be poorly designed for an L3 memory. While the

Ofcourse many networking applications are highly interleaved so the
latency can be pretty well hidden and issue rate exploited, same with
some processor designs. RL has fast-large doodad written all over it,

Are you thinking heavily multithreaded designs like Sun's Niagara
(8 4-way SMT scalar SPARC cores)?

Quote:
just work with it some. Shame about the banking, it should have been
several times finer. Now if it had been 64K banks or so, the design
would effectively look like an 8 stage pipelined SRAM with full 2.5ns
issue rate.

Of course, such would have reduced density/increased cost
per bit . Anyone guess by how much?

Quote:
latency might not be bad for such a potentially large cache
(e.g., 256 MiB using 4 chips), the design is not specifically
optimized for such usage (e.g., burst lengths of 2 and 4 [and
8 for 9b/18b wide chips] where a single large burst length
could have allowed even lower latency [at least the bursts do

RLDRAMs are best used for their interleaved flat memory model with high
issue rates, not for bursty applications. If you only want bursty,
DDR3, or RDRAM might be better.

I was thinking that a larger burst length would allow the chip design
to reduce latency further--for an 8b burst length, one would only
need one eighth of the bits to be accessible in say two cycles, the
second eighth in three cycles, etc. It would be difficult to make
half of the bits (2b burst length) sufficiently faster to significantly

reduce latency. For an L3, a smallish 64B block size would allow
for a 64b wide interface and 8b bursts (one might provide two 64b
wide ranks to increase bandwidth--as long as even and odd block
accesses are well distributed).

Quote:
Such a huge cache presents other problems, though. For anything
but a high-end server processor in which all configurations
include such off-chip cache, the tags would have to be
off-chip which would significantly increase latency for a

Tags add another 20% or more to width, but you are stuck in
conventional cache architecture.

Of course, the additional width required depends on the design.
A direct-mapped cache could get by easily with less than a 12%
increase in width (sharing tag with ECC would allow a 16b per half
cycle reading of tags for the first half of an 8b burst while the
second
half read the ECC--an 80b wide interface would support one 64b tag
per 64B cache block) with parallel read of tags and data. A more
associative cache with sequential tag then data access would
add significant latency even with early way selection based on
partial tag comparison. (One advantage of a relatively large tag
memory could be the ability to include some additional information
such as previous allocated blocks or a next-fetch prediction.)

Quote:
ISTM that RLDRAM might be more attractive as a parallel, fast
memory (software-managed cache). Unfortunately, OSes and
applications are not designed to exploit such.

No we don't really want that.

What I mean is a page-size block, fully associative, non-redundant
(i.e., data swapping not data copying) 'cache' so that TLB entries
are effectively the tags. What is so horrible about that? (Obviously
the allocation policy would have to be reasonably smart since a
4KiB page swap would use 1280ns of the fast memory interface
with a similar utilization of the main memory interface. However, I
would guess that even a simple allocation with eviction only on
page unmapping might be able to boost performance enough to
justify the added cost.)

Quote:
(It might be desirable to stripe blocks across 7 of the 8 banks
to maximize throughput for sequential accesses and reduce bank
conflict [MOD 7 is not that slow to compute]. [This could also
reduce conflict misses in a cache.] Of course, there is then
the problem of what to do with the eigth bank.)

Nor that.

What is wrong with prime modulo bank striping? Adding, say,
500ps (or less?) to the access latency should not be a big issue.
POWER4/5 use a 3-way L2 cache banking presumably to
reduce bank conflicts (taking the modulus of a 'large number' of
the physical address bits).


Paul A. Clayton
Back to top
Sander Vesik
Guest





Posted: Fri Nov 04, 2005 9:17 pm    Post subject: Re: Future memory modules Reply with quote

Dysthymicdolt@aol.com wrote:
Quote:
optimized for such usage (e.g., burst lengths of 2 and 4 [and
8 for 9b/18b wide chips] where a single large burst length
could have allowed even lower latency [at least the bursts do
seem to be required to be aligned], the multiplexed address
version does not allow for reduced latency [where an L3 could
speculate on a hit {or dirty line replacement or victim buffer
allocation} and send all but the last few {way selection}
address bits early]) and the need to explicitly refresh memory
might be a problem. The bandwidth might be acceptable even
given a moderate 72b wide data path.

instead of sending the bits last, just have a longer burst starting
at the predicted to be most likely row and then just discard the
unneeded data. it also allows you to do the tag lookup in parallel
with fetching the data and "merely" wastes power in case of misses ;-)

--
Sander

+++ Out of cheese error +++
Back to top
Guest






Posted: Sat Nov 05, 2005 1:15 am    Post subject: Re: Future memory modules Reply with quote

Sander Vesik wrote:

Quote:
instead of sending the bits last, just have a longer burst starting
at the predicted to be most likely row and then just discard the
unneeded data. it also allows you to do the tag lookup in parallel
with fetching the data and "merely" wastes power in case of misses ;-)

The problem is that the tag latency is considerable; one would
probably be able to read two blocks before one could get the tag
checked and the address for the appropriate way (if it was not
one of the first two selected)--if SRAM was used for the tags, one
could probably get by with a single block read but SRAM tags
would be more expensive. However, a 256MiB cache with 64B
blocks would require 256KiB of predictor data for 2-way
associativity; even 64KiB of predictor data (256B blocks or
predictor bit sharing) might not be small enough to justify unless
either a large fraction of the chips were sold with the cache or a
significant premium was charged for chips with a predictor. I do
not know how well a single 4b predictor per set would work for
a 16-way associative cache; MRU prediction even in such a large
cache might not work well with such high associativity.

If the tag check took longer than a single block's fetch time, one
could presumably issue a primary prediction fetch from another
data request (under high utilization) or issue a secondary
prediction fetch (though this would require more predictor data).

Unfortunately, even SRAM off-chip tags increase the latency of
memory access in the case of a miss (of course, misses should
be relatively rare with such an enormous cache).


Paul A. Clayton
just a technophile
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB