| Author |
Message |
karl malbrain
Guest
|
Posted:
Wed Dec 08, 2004 7:19 pm Post subject:
x86 level one cache associativity |
|
|
I have a 4K table and a 64 byte working data area that need to fit
simultaneously into the level one processor cache. I would prefer not to
force either of these into any specific location relative to one another,
e.g. that they fall randomly into memory.
Does 2-way set cache associativity guarantee that these two areas can
co-exist in level one cache peacefully (to the extent that a context switch
or hyper-threading thrashing doesn't occur)?
Thanks, karl m |
|
| Back to top |
|
 |
Eric
Guest
|
Posted:
Fri Dec 10, 2004 6:52 pm Post subject:
Re: x86 level one cache associativity |
|
|
karl malbrain wrote:
| Quote: | I have a 4K table and a 64 byte working data area that need to fit
simultaneously into the level one processor cache. I would prefer not to
force either of these into any specific location relative to one another,
e.g. that they fall randomly into memory.
Does 2-way set cache associativity guarantee that these two areas can
co-exist in level one cache peacefully (to the extent that a context
switch or hyper-threading thrashing doesn't occur)?
Thanks, karl m
no. cache lines are 64 bytes long. there is storage for only so many lines |
in the cache. you dont care where in the cache they are and it doesnt
matter, lines (in a particular cache) are stored based on the address.
Take a data cache:
If process A touches a large amount of data in memory marked as cachable
then at some point lines will start being evicted from the cache based on
a LRU algorithm. If process B now gets control, some of A's cache lines are
gonna have to be evicted to make room for process B's requirements.
I'm doing this entirely from memory here on so there's no
guarantee of precise accuracy here:
Lets take a WB 2 way set associative cache with room for 4 lines. Each cache
line is 64 bytes (normal for current ia32 cpu's) so the cache size is
2 * 64 *4 = 256 bytes + some space for the tag fields, LRU fields etc
Each line has a tag field, each way holds up to 4 cache lines.
The tag field is gonna be based on the bits above bit 6 because cache lines
are at Addess: 0, 0x40, 0x80, 0xc0 0x100 etc etc
Cachable Memory is fetched and written in full cache lines.
so suppose you read memory at 0x100 and the cache is empty,
0x100= 1,0 000,0000 binary
^
Way bit
<--^ tag field bits
so 1,0000,0000 way 0 == 100
1,0100,0000 way 1 == 140
it 7 of the address selects the way (0 or 1) bits 8-32 are the tag field
tag(binary) way 0 LRU bit tag way 1 LRU
10 <---64 bytes of data---> (x) xx <-data-> (x)
now suppose you read memory from say 0x140
The tag bits dont change but the way bit does so it goes into way 1
This set (2 way) is at row 2
When addr 0x100 was cached the LRU was set for way 0
Then addr 0x140 was read and cached so LRU for way was cleared and way 1's
LRU was set
So, expand this out to an 8 way set associative cache and take a cache size
of 512k - you can effectively cache up to:
512k / 128 = about 4096 lines
now back to your question: (assume a moderate sized data cache) if Process A
only touches say 1 or 2k of closely bunched data (caching 32 lines or so)
then that 2k maps into an area of the cache XX
Then if Process B only touches a small amount of data and it maps into the
cache at an area "not in XX" then you wont get cache thrashing - IF their
isnt a bunch of interrupts, IF there isnt a cache flush etc etc
But 512k is a level2 cache, level 1 cache is much much smaller so it is very
likely it will thrash to some degree and there isnt much you can do about
it unless you can control the whole system, OS, apps, interrupts, etc etc.
Remember that the P4 uses a small trace cache for L1 code and it caches
decoded instructions (Uops not the actual bytes you see composing the
instructions in memory, but micro-ops it creates from decoding those bytes
into risc like opcodes)
Final result, keep you code and data small and tight and thats the best you
can do.
Hope this helps some,
Eric |
|
| Back to top |
|
 |
Robert Redelmeier
Guest
|
Posted:
Fri Dec 10, 2004 11:28 pm Post subject:
Re: x86 level one cache associativity |
|
|
In comp.lang.asm.x86 karl malbrain <spamtrap@crayne.org> wrote:
| Quote: | I have a 4K table and a 64 byte working data area that need
to fit simultaneously into the level one processor cache.
I would prefer not to force either of these into any
specific location relative to one another, e.g. that they
fall randomly into memory.
Does 2-way set cache associativity guarantee that these two
areas can co-exist in level one cache peacefully (to the
extent that a context switch or hyper-threading thrashing
doesn't occur)?
|
Yes, presuming you're talking a 4KB table and modern
CPUs with L1 >4 KB .
AFAIK, the associativity run linearly, not folded. The 4 KB
will get loaded as a block (or at alternate addresses based on
LRU) and the 64 bytes will get loaded as one or two cachelines,
potentially at alternate addresses.
More important that the accesses be aligned.
-- Robert |
|
| Back to top |
|
 |
Guest
|
Posted:
Fri Dec 17, 2004 11:36 pm Post subject:
Re: x86 level one cache associativity |
|
|
Actually the point about "wayness" is important here. The OP's
situation is 4K of data, and 2 data structures. It works on the
Pentiums because they have 8K (>=4K) worth of primary L1 addressable
cache entries, and 4-way (>= 2) set associativity. It works on Athlons
because they have 32K (>=4K) worth of addressable cache entries, and
2-way (>=2) set associativity. It works on K6's because they have 16K
(>=4K) of addressable cache entries, and 2-way (>=2) set associativity.
I.e., since the OP wants both size *and* wayness, he's basically got to
divide the L1 cache size by the wayness itself to when considering
size, and then compare the number of independent data structures to the
number of ways of associativity that the cache supports.
But If the OP had a 16K data structure, for example, it still works,
even for the Pentium, because it would be forced to use up 2-ways to
hold it in the L1, but would still have 2-ways left for the remaining
data.
--
Paul Hsieh
http://www.pobox.com/~qed/
http://bstring.sf.net/ |
|
| Back to top |
|
 |
karl malbrain
Guest
|
Posted:
Sat Dec 18, 2004 1:00 am Post subject:
Re: x86 level one cache associativity |
|
|
<spamtrap@crayne.org> wrote in message
news:1103325428.613126.303800@f14g2000cwb.googlegroups.com...
| Quote: | Actually the point about "wayness" is important here. The OP's
situation is 4K of data, and 2 data structures. It works on the
Pentiums because they have 8K (>=4K) worth of primary L1 addressable
cache entries, and 4-way (>= 2) set associativity. It works on Athlons
because they have 32K (>=4K) worth of addressable cache entries, and
2-way (>=2) set associativity. It works on K6's because they have 16K
(>=4K) of addressable cache entries, and 2-way (>=2) set associativity.
I.e., since the OP wants both size *and* wayness, he's basically got to
divide the L1 cache size by the wayness itself to when considering
size, and then compare the number of independent data structures to the
number of ways of associativity that the cache supports.
But If the OP had a 16K data structure, for example, it still works,
even for the Pentium, because it would be forced to use up 2-ways to
hold it in the L1, but would still have 2-ways left for the remaining
data.
|
Thanks, this is exactly what I'm looking for. karl m |
|
| Back to top |
|
 |
|
|
|
|