Off-chip cache considerations?
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Off-chip cache considerations?

 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Guest






Posted: Sat Oct 29, 2005 9:22 pm    Post subject: Off-chip cache considerations? Reply with quote

What are the trade-offs that architects consider when deciding
whether to support an off-chip cache in an implementation?

Obviously, off-chip cache allows a capacity which would be
impractical for on-chip cache; however, the latency penalty of
going off-chip is significant, especially when combined with an
on-chip memory controller. For larger multiple node systems, the
latency of remote accesses tends to increase the benefit of even
an off-chip cache and the reduction in coherence traffic
(especially with snoopy coherence) and memory traffic (especially
multiple hop traffic) can make a large cache more attractive.

An off-chip cache can also take advantage of memory components
manufactured in higher volumes than the processor (obviously a
consideration for high-end server processors but also perhaps for
embedded processors). OTOH, one then loses the ability to
optimize the memory components for the specific implementation
(e.g., the memory component can be designed to talk with like
memory components of other processors [exploiting the additional
pin potential from an additional chip] or eDRAM accesses can be
pipelinedby separate transmission of row and column addresses).

Supporting an off-chip cache also allows a single chip to be
packaged and sold in a less expensive version without the
off-chip cache or with a smaller off-chip cache (although such
differention is possible with on-chip cache, if the count of
partially defective chips is insufficient to meet demand, some
die area will be wasted).

On the negative side, an off-chip cache requires expensive
off-chip bandwidth and increases manufacturing complexity and
cost. (Because of the likely tighter integration of the cache
with the processor [relative to memory and other processors] the
bandwidth is less expensive [shorter wires --> faster clock].)


Paul A. Clayton
Back to top
Bill Todd
Guest





Posted: Sun Oct 30, 2005 12:15 am    Post subject: Re: Off-chip cache considerations? Reply with quote

Dysthymicdolt@aol.com wrote:
Quote:
What are the trade-offs that architects consider when deciding
whether to support an off-chip cache in an implementation?

Obviously, off-chip cache allows a capacity which would be
impractical for on-chip cache; however, the latency penalty of
going off-chip is significant, especially when combined with an
on-chip memory controller.

Indeed. POWER4/4+/5 systems have exhibited this characteristic with
their L3 cache for many years, and IIRC PA-RISC systems as well (for
their L1 caches, no less). With POWER4/4+ I think the off-chip cache
support did double duty by eliminating the need for a separate memory
controller, but of course POWER5 has changed that.

For larger multiple node systems, the
Quote:
latency of remote accesses tends to increase the benefit of even
an off-chip cache and the reduction in coherence traffic
(especially with snoopy coherence) and memory traffic (especially
multiple hop traffic) can make a large cache more attractive.

That certainly seems to have been the rationale behind the POWER
implementations above, though it's difficult to understand how having a
humongous but slow L1 was a win for PA-RISC in anything save size and
cost (eDRAM is far denser than SRAM, but less compatible with the
processes used to implement the rest of the core).

IBM used off-chip L4 shared eDRAM in their multi-quad-Xeons as well
until X3 came along with its 'virtual cache' (I've never known whether
the off-chip POWER cache is eDRAM, but they don't seem to charge a great
deal for the chips so it may be - and the latency quoted is over twice
what they could potentially have achieved with off-chip SRAM, which at
least on Alpha had sub-20-ns. latencies). IIRC Horus does something
similar with Opterons (though I don't recall whether it's eDRAM or cache
directly on the controller chip).

In other words, you've pretty well covered the territory that I'm
familiar with, so I'll be curious too to see what else may pop up.

- bill
Back to top
David Kanter
Guest





Posted: Mon Oct 31, 2005 1:12 am    Post subject: Re: Off-chip cache considerations? Reply with quote

Bill Todd wrote:
Quote:
Dysthymicdolt@aol.com wrote:
What are the trade-offs that architects consider when deciding
whether to support an off-chip cache in an implementation?

Obviously, off-chip cache allows a capacity which would be
impractical for on-chip cache; however, the latency penalty of
going off-chip is significant, especially when combined with an
on-chip memory controller.

Indeed. POWER4/4+/5 systems have exhibited this characteristic with
their L3 cache for many years, and IIRC PA-RISC systems as well (for
their L1 caches, no less). With POWER4/4+ I think the off-chip cache
support did double duty by eliminating the need for a separate memory
controller, but of course POWER5 has changed that.

The POWER4/4+ indeed had problems with their L3, but I believe that the
cache on the POWER5 is far more effective.

Quote:
For larger multiple node systems, the
latency of remote accesses tends to increase the benefit of even
an off-chip cache and the reduction in coherence traffic
(especially with snoopy coherence) and memory traffic (especially
multiple hop traffic) can make a large cache more attractive.

That certainly seems to have been the rationale behind the POWER
implementations above, though it's difficult to understand how having a
humongous but slow L1 was a win for PA-RISC in anything save size and
cost (eDRAM is far denser than SRAM, but less compatible with the
processes used to implement the rest of the core).

Were the L1's always off-die for PA-RISC?

David
Back to top
Bill Todd
Guest





Posted: Mon Oct 31, 2005 1:15 am    Post subject: Re: Off-chip cache considerations? Reply with quote

David Kanter wrote:
Quote:
Bill Todd wrote:

Dysthymicdolt@aol.com wrote:

What are the trade-offs that architects consider when deciding
whether to support an off-chip cache in an implementation?

Obviously, off-chip cache allows a capacity which would be
impractical for on-chip cache; however, the latency penalty of
going off-chip is significant, especially when combined with an
on-chip memory controller.

Indeed. POWER4/4+/5 systems have exhibited this characteristic with
their L3 cache for many years, and IIRC PA-RISC systems as well (for
their L1 caches, no less). With POWER4/4+ I think the off-chip cache
support did double duty by eliminating the need for a separate memory
controller, but of course POWER5 has changed that.


The POWER4/4+ indeed had problems with their L3, but I believe that the
cache on the POWER5 is far more effective.

I didn't mention anything about 'problems' with the POWER4/4+ off-chip
cache, and while the POWER5 off-chip L3 has somewhat improved latency it
turns out to be an even larger fraction of the main-memory latency than
was the case in POWER4/4+ (due to the vastly improved main-memory
latency in POWER5).

Quote:


For larger multiple node systems, the

latency of remote accesses tends to increase the benefit of even
an off-chip cache and the reduction in coherence traffic
(especially with snoopy coherence) and memory traffic (especially
multiple hop traffic) can make a large cache more attractive.

That certainly seems to have been the rationale behind the POWER
implementations above, though it's difficult to understand how having a
humongous but slow L1 was a win for PA-RISC in anything save size and
cost (eDRAM is far denser than SRAM, but less compatible with the
processes used to implement the rest of the core).


Were the L1's always off-die for PA-RISC?

The 'IIRC' was there for a reason - perhaps someone else's recollection
is firmer.

- bill
Back to top
Anton Ertl
Guest





Posted: Mon Oct 31, 2005 9:15 am    Post subject: Re: Off-chip cache considerations? Reply with quote

Bill Todd <billtodd@metrocast.net> writes:
Quote:
David Kanter wrote:
Bill Todd wrote:
it's difficult to understand how having a
humongous but slow L1 was a win for PA-RISC in anything save size and
cost (eDRAM is far denser than SRAM, but less compatible with the
processes used to implement the rest of the core).


Were the L1's always off-die for PA-RISC?

On the PA8500 and later they were on-chip.

IIRC they also had an L0 I-cache or something.

Quote:
The 'IIRC' was there for a reason - perhaps someone else's recollection
is firmer.

IIRC the HP-PA L1 caches were not particularly slow in terms of cycles
(3 cycles? Less than a Prescott), but the cycle times were longish
during some period (also caused by HP not shrinking to 0.35um).

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
Bill Todd
Guest





Posted: Mon Oct 31, 2005 5:15 pm    Post subject: Re: Off-chip cache considerations? Reply with quote

Anton Ertl wrote:
Quote:
Bill Todd <billtodd@metrocast.net> writes:

David Kanter wrote:

Bill Todd wrote:

it's difficult to understand how having a
humongous but slow L1 was a win for PA-RISC in anything save size and
cost (eDRAM is far denser than SRAM, but less compatible with the
processes used to implement the rest of the core).


Were the L1's always off-die for PA-RISC?


On the PA8500 and later they were on-chip.

IIRC they also had an L0 I-cache or something.

Ah - that would help explain things.

Quote:


The 'IIRC' was there for a reason - perhaps someone else's recollection
is firmer.


IIRC the HP-PA L1 caches were not particularly slow in terms of cycles
(3 cycles? Less than a Prescott), but the cycle times were longish
during some period (also caused by HP not shrinking to 0.35um).

The off-chip PA caches that I think I recall had something like 40 ns.
latency (and IIRC were eDRAM).

- bill
Back to top
Guest






Posted: Tue Nov 01, 2005 9:15 am    Post subject: Alpha 21264? L2 cache (was Re: Off-chip cache considerations Reply with quote

Bill Todd wrote:
[snip]
Quote:
(I've never known whether
the off-chip POWER cache is eDRAM, but they don't seem to charge a great
deal for the chips so it may be - and the latency quoted is over twice
what they could potentially have achieved with off-chip SRAM, which at
least on Alpha had sub-20-ns. latencies).

I am guessing that the Alpha's off-chip cache was simply
direct-mapped to achieve such low latency. I seems unfortunate
that such off-chip caches with off-chip tags did not use a
skewing function to reduce conflict misses. (It would even be
possible to use a different function for code by extending
Icache block invalidation instructions to either invalidate L2
code entries or check the address for dirty data in the Dcache
and in L2 with the data indexing function. Alternately, the
skewing function could be set on a per-page basis using perhaps
two bits of the PTEs. One might even have considered page-level
way prediction, perhaps optimized for latency by checking the
tags for all ways in parallel with the predicted way data
access; one might even consider putting a mispredicted dirty
block in the write buffer if the buffer is mostly empty.)

(BTW, the POWER4/4+/5 L3 cache uses eDRAM/1T-SRAM.)


Paul A. Clayton
Back to top
Anton Ertl
Guest





Posted: Tue Nov 01, 2005 9:15 am    Post subject: Re: Alpha 21264? L2 cache (was Re: Off-chip cache considerat Reply with quote

Dysthymicdolt@aol.com writes:
Quote:

Bill Todd wrote:
[snip]
(I've never known whether
the off-chip POWER cache is eDRAM, but they don't seem to charge a great
deal for the chips so it may be - and the latency quoted is over twice
what they could potentially have achieved with off-chip SRAM, which at
least on Alpha had sub-20-ns. latencies).

The best I have seen was 22ns lmbench latency on an UP1500 with 8MB L2
<2002May11.094029@a0.complang.tuwien.ac.at>. Yes, could be sub-20ns
in the non-back-to-back case; in any case, this was competetive with
some, much smaller, on-chip caches from the same time.

Quote:
I am guessing that the Alpha's off-chip cache was simply
direct-mapped to achieve such low latency.

Off-chip caches typically were, also because more associativity would
have required more pins.

Quote:
I seems unfortunate
that such off-chip caches with off-chip tags did not use a
skewing function to reduce conflict misses.

A skewing function reduces a particular form of conflict misses, but
for an 8MB cache that form should not be a big issue (how many
programs have arrays exactly 8MB apart that they process in lockstep).
A skewing function also reduces the benefit you get from spatial
locality (i.e., less conflict misses for spatially local stuff).

But if you want skewing, you can achieve that easily by appropriate
virtual-to-physical mapping. If the OS does not do page colouring,
this usually avoids the conflicts that are pathological for a virtual
direct-mapped cache; however, you get others, less predictable ones
instead.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
Andy Glew
Guest





Posted: Tue Nov 01, 2005 11:49 pm    Post subject: Re: Alpha 21264? L2 cache (was Re: Off-chip cache considerat Reply with quote

Quote:
I am guessing that the Alpha's off-chip cache was simply
direct-mapped to achieve such low latency.

Off-chip caches typically were, also because more associativity would
have required more pins.

More associativity does not always require more pins for external caches.

Conventional associative caches - those described in textbooks such as
H&P - might require more pins. E.g. they might perform N tag matches
in parallel with reading N ways of data out of the data array, and
perform a late select. If the tags and/or way select logic is inside
the processor chip, and the data array is external, this would require
N times more wires.

However, there are at least two configurations where more
associativity does not require more wires.

a) do the N-way tag match and way select externally => only one set of
address wires crosses the chip boundary. But the external cache is
more complicated. IIRC the Mororola 88K CAMMU chips did this.

b) More interesting: use the "tag-index sequential" technique. The N-way
tag match is done, returning an index into the data array. Then the
data array access is done. The data array access is thus sequential
with the tag match, not parallel. If the data array is external, this
means that even fewer pins may be required - just a data array index
number, not a full address - although that would necessitate a
look-aside rather than a look-through cache configuration.

Tag-index sequential obviously adds latency compared to tag-index
parallel. But if the tag logic is fast enough - e.g. the tag logic is
done in a fast logic process, while the external cache is done in a
slower process - and if the performance increment due to a larger
cache size (and/or a faster transfer, and/or power saving, and/or
higher associativity) outweighs the cost of the extra latency,
tag-index sequential makes sense.

Tag-index sequential designs are usually worth considering for L2 or
L3 designs, whether internal or external. Indeed, in modern chips with
multiple frequency and power domains, the internal/external issues are
revisited, albeit with different parameters.

Doug Burger got his PhD in part for analyziing tag-sequential designs.
Off the top of my head I cannot think of any external caches that were
tag-sequential. However, I was surprised to find, when doing patent
research, that old set associative cache designs from IBM, etc., were
nearly all tag-index sequential.
Back to top
Iain McClatchie
Guest





Posted: Wed Nov 02, 2005 1:17 am    Post subject: Re: Alpha 21264? L2 cache (was Re: Off-chip cache considerat Reply with quote

Andy> a) do the N-way tag match and way select externally => only
Andy> one set of address wires crosses the chip boundary. But
Andy> the external cache is more complicated. IIRC the Mororola
Andy> 88K CAMMU chips did this.

The MIPS R8000 consisted of 2 full-custom chips (IU and FPU), a
gate array (cache controller), and three custom tag rams. The
external L2 cache (4MB in 1993) was 4-way set associative. We
sent the upper bits to the tag rams, they did a set match and
drove the set number to the SRAMs as two bits of the address.
IU drove the other two bits.

We had *two* 64b paths through the L2, an even and an odd, and
could do independent accesses through both.

Andy> Off the top of my head I cannot think of any external
Andy> caches that were tag-sequential.

The R10K and successors has a tag-sequential 2-way L2, and a
way predictor on the CPU.
Back to top
Guest






Posted: Wed Nov 02, 2005 9:15 am    Post subject: Re: Alpha 21264? L2 cache (was Re: Off-chip cache considerat Reply with quote

Anton Ertl wrote:
[snip]
Quote:
A skewing function reduces a particular form of conflict misses, but
for an 8MB cache that form should not be a big issue (how many
programs have arrays exactly 8MB apart that they process in lockstep).
A skewing function also reduces the benefit you get from spatial
locality (i.e., less conflict misses for spatially local stuff).

The following skewing function should not hurt too much in the
bad cases:
For an 8MiB direct-mapped cache,
if addr[23] is set,
MSb ({[9:6][22:10]}XOR[24])[5:0] LSb
if addr[23] unset,
MSb ([22:6]XOR[24])[5:0] LSb

By having half of the memory use a scattered distribution
and having an independent half use inverted ordering, the
common case of forward sequential access should have fewer
bad conflicts, no?? On the negative side, a physically
contiguously allocated heap smaller than 8MiB that crosses
the 8MiB boundary would generate more misses than in the
simple direct-mapped cache; also dirty evicted blocks would
more frequently be scattered across multiple DRAM rows
(when a scattered distribution stream is replacing a dirty
sequential stream), perhaps significantly hurting performance.
ISTM that this kind of indexing function could be superior to
straightforward indexing in a 2-way LRU replacement cache.

Quote:
But if you want skewing, you can achieve that easily by appropriate
virtual-to-physical mapping. If the OS does not do page colouring,
this usually avoids the conflicts that are pathological for a virtual
direct-mapped cache; however, you get others, less predictable ones
instead.

Unfortunately, such requires special OS support and special
effort on the part of the application developer. Also the
page-level granularity could introduce some whole-page conflicts.

I assume the Alpha architects did the right thing for their
targeted workloads; I am just a bit superstitious about
direct-mapped caches, I guess.


Paul A. Clayton
Back to top
Guest






Posted: Wed Nov 02, 2005 9:15 am    Post subject: Re: Alpha 21264? L2 cache (was Re: Off-chip cache considerat Reply with quote

Andy Glew wrote:
[snip]
Quote:
More associativity does not always require more pins for external caches.

If one has off-chip tags and on-chip comparitor (from a quick
browsing of the 21264 HRM it was not entirely clear if the
comparitor was on the processor chip), more associativity
would seem to require more tag bandwidth. Presumably off-chip
tags were chosen to allow highly variable system costs. A
large direct-mapped cache might have had attractive performance
for the target applications (so off-chip way-select [really
hit/miss detect] logic would be an excessive expense). Also
parallel read would still require extra SRAM pins even with
off-chip way-selection; I don't know if that would have been
a significant consideration at that time.

BTW, as an accelerator for tag-index sequential, would it make
sense to use two sets of partial tags to provide a fast,
accurate way prediction (and common case miss detection)? It
seemed that this might allow a pipelining of accesses so that
slightly more than two tag checks could occur per cycle
without double porting or replication (slightly more than two
because most misses would be detected by only accessing one
set of partial tags). The way predicting set of tags might
be placed nearer the data, while the other set might be
placed to reduce snooping latency.


Paul A. Clayton
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Page 1 of 1

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB