Non Power of 2 Cache Sizes
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Non Power of 2 Cache Sizes
Goto page Previous  1, 2, 3  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Terje Mathisen
Guest





Posted: Wed Aug 17, 2005 12:15 am    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

Nick Maclaren wrote:

Quote:
In article <ddsgb9$4em$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
|> If you are going to do something like this (i.e. having a prime number
|> of banks), then it is probably a good idea to select a prime of the form
|> 2^n-1 or 2^n+1, since you can do div/mod operations on these a lot
|> faster than on a random modulus.

Nope. That is true only if you use the current, ghastly, designs.
Such a cache structure is undesirable on performance grounds, even
with a power of two, because it is really nasty when your array
sizes match the cache size.

I agree. Prime bases are better though.
Quote:

What you should do is to hash the address - as this need not be a
cryptographic hash, doesn't have to be perfect, and doesn't have
to be done in software, it can be fast. Such a hash can produce
values in any range.

But that is what we're currently doing!

All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)

Terje

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
John Savard
Guest





Posted: Wed Aug 17, 2005 6:20 am    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

On 16 Aug 2005 00:01:01 -0700, "Vikas Mishra" <vikas.mishra@gmail.com>
wrote, in part:

Quote:
But is there any other real reason besides this why "non power of 2
caches" can't be used ?

The obvious reason is that you are partly wasting address decoding
circuitry for the cache.

But that hasn't stopped Intel from having 3 MB caches for some of its
Itanium chips. Maybe it's really a 4 MB cache of which part is not used,
so as to get the yield tolerable for such huge dies...

John Savard
http://www.quadibloc.com/index.html
_________________________________________
Usenet Zone Free Binaries Usenet Server
More than 120,000 groups
Unlimited download
http://www.usenetzone.com to open account
Back to top
Joe Pfeiffer
Guest





Posted: Wed Aug 17, 2005 8:15 am    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

"Vikas Mishra" <vikas.mishra@gmail.com> writes:

Quote:
Hi Folks,

We were having a discussion today about having a "non power of 2
instruction cache". Conceptually I would think that the only "hard"
reason for having a power of 2 is that the line index calculation is
trivial and for a non power of 2 it would be some complicated
calculation like modulus with 40K (for a size of say 40KB) or something
equally strange.

But once you've hit this, the question becomes one of what compelling
reason do you have to *not* have a power of 2?
--
Joseph J. Pfeiffer, Jr., Ph.D. Phone -- (505) 646-1605
Department of Computer Science FAX -- (505) 646-1002
New Mexico State University http://www.cs.nmsu.edu/~pfeiffer
skype: jjpfeifferjr
Back to top
John Mashey
Guest





Posted: Wed Aug 17, 2005 8:15 am    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

John Savard wrote:
Quote:
On 16 Aug 2005 00:01:01 -0700, "Vikas Mishra" <vikas.mishra@gmail.com
wrote, in part:

But is there any other real reason besides this why "non power of 2
caches" can't be used ?

The obvious reason is that you are partly wasting address decoding
circuitry for the cache.

But that hasn't stopped Intel from having 3 MB caches for some of its
Itanium chips. Maybe it's really a 4 MB cache of which part is not used,
so as to get the yield tolerable for such huge dies...

Sigh.

1) *No one* builds a die with 4MB of cache and only uses 3MB. Any
designer who proposed that would be fired. There are different ways to
do this, of which some have been used since the mid-1990s or earlier in
microprocessors, but they involve adding a just a modest amount of
extra silicon to the cache to improve yield.

2) I lose track of the various versions, but in Itanium 2-land, L3
caches:
3MB: 12-way
6MB: 24-way
9MB: 18-way

The L1 caches were 16KB (4-way) and L2 was 256KB (8-way).

Quote:
From 2003 Hot Chips presentation on Itanium 2:
a) The L3 is a big chunk of the chip, of quite irregular shape, and it

is organized as 140 "tiles", of which:
128 are for data
8 for ECC
4 for yield redundancy

For the gory details see:
Rusau, Muljone, Cherkauer, "Itanium 2 Processor 6M: Higher Frequency
and Larger L3 Cache," IEEE Micro Vol 24, No 2 (March/April 2004), 10-19
Back to top
Nick Maclaren
Guest





Posted: Wed Aug 17, 2005 1:29 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

In article <ddtk9v$qqh$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
Quote:

What you should do is to hash the address - as this need not be a
cryptographic hash, doesn't have to be perfect, and doesn't have
to be done in software, it can be fast. Such a hash can produce
values in any range.

But that is what we're currently doing!

All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)

Er, yes. Sorry about being massively unclear. I made a hash of using
the term hash :-(

What I meant was using a slightly scrambled hash - e.g. consider the
following to reduce 64 bits to 16:

( N ^ (N>>1) ^ (N>>4) ^ (N>>11) ^ (N>>26)) ^ (N>>57) ) & 0xffff

This is dead easy to calculate fast in hardware, and would remove
many of the "gotchas". It was just picked up off the top of my head,
and there are doubtless better choices.

I don't know how easy it is to take modulo N for a fixed N is in
hardware - the fact that doing it for a variable N is a pain proves
little. As I have posted before, x%N for fixed N can be done a LOT
faster for fixed but arbitrary N (i.e. ignoring setup costs) than
for variable N in software, but still not fast enough for this use.


Regards,
Nick Maclaren.
Back to top
Nick Maclaren
Guest





Posted: Wed Aug 17, 2005 4:15 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

In article <1124293178.471524.312690@g47g2000cwa.googlegroups.com>,
"David Kanter" <dkanter@gmail.com> writes:
|>
|> What exactly is the problem with having your array size match the cache
|> size? Sorry, ISTR something like this mentioned in H&P:aQa, I just
|> don't recall the exact problem.

With a fully-associative cache, there is little problem. With
one with a bounded associativity, there is a major problem when
the size of one dimension of a matrix is a multiple of that of
the cache size. Try writing the naive matrix multiply on square
matrices and see what happens whan that is so.

Essentially, every time round the inner loop (N^3 of them) means
several cache misses, including a write of a dirty line.


Regards,
Nick Maclaren.
Back to top
David Kanter
Guest





Posted: Wed Aug 17, 2005 4:15 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

Nick Maclaren wrote:
Quote:
In article <ddv7hu$is1$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
|
|> > I don't know how easy it is to take modulo N for a fixed N is in
|> > hardware - the fact that doing it for a variable N is a pain proves
|> > little. As I have posted before, x%N for fixed N can be done a LOT
|> > faster for fixed but arbitrary N (i.e. ignoring setup costs) than
|> > for variable N in software, but still not fast enough for this use.
|
|> Fixed N, where N is on the form (2^n) +/- 1 can be done more or less (*)
|> in the same time as your example above, the code is similar.

Oh, yes, but the problem with numbers like that is that they aren't
rare as array sizes in programs. They may not be AS common as
powers of two, but 2^n+1 in particular is definitely common. It
falls naturally out of certain FFT designs. 2^n-1 is rarer, but
not rare.

The advantage of a non-modulo approach is that it doesn't have
the catastrophic failure mode when a critical array size matches
the cache size. All modulo approaches do, but array sizes tend
to follow certain patterns.

What exactly is the problem with having your array size match the cache
size? Sorry, ISTR something like this mentioned in H&P:aQa, I just
don't recall the exact problem.

David
Back to top
Anne & Lynn Wheeler
Guest





Posted: Wed Aug 17, 2005 4:15 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

"John Mashey" <old_systems_guy@yahoo.com> writes:
Quote:
Sigh.

1) *No one* builds a die with 4MB of cache and only uses 3MB. Any
designer who proposed that would be fired. There are different ways to
do this, of which some have been used since the mid-1990s or earlier in
microprocessors, but they involve adding a just a modest amount of
extra silicon to the cache to improve yield.

walk down memory lane ...

big change from 370/168-1 to 370/168-3 was to double the cache size
from 32kbytes to 64kbytes ... however they used the 2kbit to index the
additional entries.

the problem was that worked when the software was running 4k virtual
pages ... but if the software was running with 2k virtual pages, it
would only use 32kbytes.

also, if you had software that switched betweek 2k virtual pages and
4k virtual pages ... the cache was flushed at the switch.

the dos/vs and vs1 operating systems ran 2k virtual pages, mvs ran 4k
virtual pages. vm for "real" address virtual machines ran 4k virtual
pages by default ... but if they were running guest operating systems
that used virtual memory ... the shadow pages were whatever the guest
was using.

there was some large customer running vs1 production work under vm on
168-1 and installed the 168-3 upgrade (32k->64k cache) expecting
performance to get better ... but things got much worse. first, since
vs1 was running 2k virtual pages ... all of the vm shadow tables for
vs1 were 2k virtual pages ... which met the cache only worked with
32kbytes. the other was that vm supervisor reloaded 4k virtual page
mode by default ... so there was lots of cache flush overhead
constantly switching back and forth betweek 2k and 4k virtual page
moades.

the 370 was purely 24bit/16mbyte virtual address spaces. the 168s used
the 8mbyte bit in TLB entry selection. MVS (& svs) had 16mbyte space
mapped out so that there was an 8mbyte MVS kernel image in each of its
virtual address spaces. That left only 8mbytes (in each virtual
address space) for running application (although as system functions
grew ... there were some environments where only 3mbytes in each
virtual address space left for application execution).

the 168 had a 128 entry TLB with seven entry "STO" stack (segment
table origin, basically uniquely identified each virtual address
sapce).

For basic MVS, half the TLB entries went to the kernel image pages and
half the TLB entries went to application pages.

however, a lot of virtual address space evivronments running under VM
tended to start virtual address space use at zero and grow upwards, a
large number of applications rarely ever exceeded a couple mbytes ...
so frequently at least half the TLB entries (for virtual address
Quote:
8mbytes) went unused.

--
Anne & Lynn Wheeler | http://www.garlic.com/~lynn/
Back to top
Terje Mathisen
Guest





Posted: Wed Aug 17, 2005 4:15 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

Nick Maclaren wrote:
Quote:
What I meant was using a slightly scrambled hash - e.g. consider the
following to reduce 64 bits to 16:

( N ^ (N>>1) ^ (N>>4) ^ (N>>11) ^ (N>>26)) ^ (N>>57) ) & 0xffff

This is dead easy to calculate fast in hardware, and would remove
many of the "gotchas". It was just picked up off the top of my head,
and there are doubtless better choices.

I don't know how easy it is to take modulo N for a fixed N is in
hardware - the fact that doing it for a variable N is a pain proves
little. As I have posted before, x%N for fixed N can be done a LOT
faster for fixed but arbitrary N (i.e. ignoring setup costs) than
for variable N in software, but still not fast enough for this use.

Fixed N, where N is on the form (2^n) +/- 1 can be done more or less (*)
in the same time as your example above, the code is similar.

Terje

(*) A little more, because you introduce a small adder in the path.

--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching"
Back to top
Nick Maclaren
Guest





Posted: Wed Aug 17, 2005 4:15 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

In article <ddv7hu$is1$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
|> >
|> > I don't know how easy it is to take modulo N for a fixed N is in
|> > hardware - the fact that doing it for a variable N is a pain proves
|> > little. As I have posted before, x%N for fixed N can be done a LOT
|> > faster for fixed but arbitrary N (i.e. ignoring setup costs) than
|> > for variable N in software, but still not fast enough for this use.
|>
|> Fixed N, where N is on the form (2^n) +/- 1 can be done more or less (*)
|> in the same time as your example above, the code is similar.

Oh, yes, but the problem with numbers like that is that they aren't
rare as array sizes in programs. They may not be AS common as
powers of two, but 2^n+1 in particular is definitely common. It
falls naturally out of certain FFT designs. 2^n-1 is rarer, but
not rare.

The advantage of a non-modulo approach is that it doesn't have
the catastrophic failure mode when a critical array size matches
the cache size. All modulo approaches do, but array sizes tend
to follow certain patterns.


Regards,
Nick Maclaren.
Back to top
Seongbae Park
Guest





Posted: Wed Aug 17, 2005 5:08 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

Vikas Mishra <vikas.mishra@gmail.com> wrote:
....
Quote:
In case there have been real/experimental designs with non power of 2
caches, could you point me to references.

One of the most recent examples: Sun's Niagara with 12-way 4-banked 3MB L2:

Niagara: A 32-Way Multithreaded Sparc Processor
Kongetira, P.; Aingaran, K.; Olukotun, K.;
Micro, IEEE
Volume 25, Issue 2, March-April 2005 Page(s):21 - 29

There are good reasons for this seemingly odd-choice.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
Back to top
Sander Vesik
Guest





Posted: Wed Aug 17, 2005 5:18 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:
Quote:

All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)

using mod for the hash is probably fundamentaly wrong way to
go about it though.

Quote:

Terje


--
Sander

+++ Out of cheese error +++
Back to top
Guest






Posted: Wed Aug 17, 2005 10:40 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

Quote:
Hashing is becoming a lost art,
just like everything else of
importance in computing ;-)

That's not quite a haiku. Too many syllables.
Back to top
Anton Ertl
Guest





Posted: Wed Aug 17, 2005 10:45 pm    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

Sander Vesik <sander@haldjas.folklore.ee> writes:
Quote:
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:

All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)

using mod for the hash is probably fundamentaly wrong way to
go about it though.

Why?

Apart from being simple to implement, the mod approach has the
following benefits:

- Its performance effects (in particular, the worst-case performance)
are relatively easy to understand, making it easy to avoid them.

- Spatial locality performs well with it. If I have, e.g., a 4KB way
size, then I know that I can access a 4KB array without incurring any
conflict misses. If, OTOH, i had a random-style hash function, even
two cache lines that are adjacent in virtual address space could
conflict with each other. BTW, that's one of the benefits of using
page colouring over arbitrary page mapping: It performs better in the
presence of spatial locality. Of course, with enough associativity in
the cache, it does not matter.

- anton
--
M. Anton Ertl Some things have to be seen to be believed
anton@mips.complang.tuwien.ac.at Most things have to be believed to be seen
http://www.complang.tuwien.ac.at/anton/home.html
Back to top
Nick Maclaren
Guest





Posted: Thu Aug 18, 2005 12:07 am    Post subject: Re: Non Power of 2 Cache Sizes Reply with quote

In article <2005Aug17.194501@mips.complang.tuwien.ac.at>,
Anton Ertl <anton@mips.complang.tuwien.ac.at> wrote:
Quote:
Sander Vesik <sander@haldjas.folklore.ee> writes:
Terje Mathisen <terje.mathisen@hda.hydro.com> wrote:

All caches use (extremely simple) hash functions. Doing mod(prime)
instead of mod(2^n) just reduces the number of bad interactions. :-)

using mod for the hash is probably fundamentaly wrong way to
go about it though.

Why?

Apart from being simple to implement, the mod approach has the
following benefits:

- Its performance effects (in particular, the worst-case performance)
are relatively easy to understand, making it easy to avoid them.

Oh, really? Do tell.

As with strength reduction and integer multiplication, 90% of the
cases are easy to feasible. The remaining 10% are difficult to
impossible (and, yes, I have seen the latter). Perhaps the most
common example of a damn-near impossible problem with cache sizes
is a 1-D FFT of arbitrary and uncontrollable size, but there are
others.

Quote:
- Spatial locality performs well with it. If I have, e.g., a 4KB way
size, then I know that I can access a 4KB array without incurring any
conflict misses. ...

Provided that your program uses no other data, not even workspace
generated by the compiler. Please come back slightly closer to
reality.



Regards,
Nick Maclaren.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3  Next
Page 2 of 3

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB