| Author |
Message |
Guest
|
Posted:
Wed Nov 09, 2005 11:46 pm Post subject:
TLB implementation: CAM vs SRAM |
|
|
i was curious as to why TLBs are implemented using CAMs.
specifically, it seems to me that the organization used in caches
(e.g., 4-way set associative) would work as well for a TLB as for a
D-cache - you can make much larger SRAMs than CAMs.
i'm guessing it has something to do with the fact that the page table
already gets the benefit of the cache, but it's not obvious. (maybe a
small CAM is faster than the cache?)
cheers,
adnan
ps - i teach a CMOS vlsi design class at UT Austin, and we're covering
RAMs, which led to this question |
|
| Back to top |
|
 |
Iain McClatchie
Guest
|
Posted:
Thu Nov 10, 2005 1:01 am Post subject:
Re: TLB implementation: CAM vs SRAM |
|
|
Adnan> specifically, it seems to me that the organization used in
caches
Adnan> (e.g., 4-way set associative) would work as well for a TLB as
for a
Adnan> D-cache - you can make much larger SRAMs than CAMs.
R8000 had a 3-way set associative TLB with 384 entries. (It was 4-way
until we ran out of die space.)
Adnan> i'm guessing it has something to do with the fact that the page
table
Adnan> already gets the benefit of the cache, but it's not obvious.
I don't really know the answer, but I do know that
- you need at least as much associativity in the TLB as in the cache,
usually more.
- you have a lot fewer entries in the TLB. That makes the number of
sets very small, which makes short bitlines, which gets the SRAM
off the sweet spot.
Adnan> (maybe a small CAM is faster than the cache?)
Adnan> i teach a CMOS vlsi design class at UT Austin, and we're
covering
Adnan> RAMs, which led to this question
Well prof, is a CAM faster than an SRAM, comparators, and mux? Seems
like a good class project. How much smaller does it have to be to be
faster than a 2-way SRAM design? 4-way? 8-way? (Note that CAMs
aren't fussy about powers of 2, so you can get a very precise answer
here.)
If you're nice, give the students a fixed CAM and SRAM cell. If you're
really nice, tell me what you find out!
If you're mean, give them an address trace and dinero and let them
figure
out, at the breakeven speed point, which of the CAM or SRAM has better
hit rates for 4KB blocks. |
|
| Back to top |
|
 |
Dale Morris
Guest
|
Posted:
Thu Nov 10, 2005 1:15 am Post subject:
Re: TLB implementation: CAM vs SRAM |
|
|
<adnan.aziz@gmail.com> wrote in message
news:1131558407.194222.326600@g47g2000cwa.googlegroups.com...
| Quote: | i was curious as to why TLBs are implemented using CAMs.
specifically, it seems to me that the organization used in caches
(e.g., 4-way set associative) would work as well for a TLB as for a
D-cache - you can make much larger SRAMs than CAMs
|
One reason is variable page sizes (for architectures that support them).
Caches typically have a fixed line size, and so the division between index
and tag for a set-associative cache can be done independent of the data
being accessed. But, for TLBs that can hold different page sizes, no static
division of address bits between index and tag gives you the optimal design
(where a mapping for a large page appears just once in the TLB, saving much
space over having multiple, smaller mappings).
You could do something like sectoring (as is done in caches), but this
doesn't give you much dynamic range, and it would mostly defeat the purpose
of using only a single TLB entry to map a large address range.
With a CAM implementation of a TLB, each entry holds the virtual page number
(which may be variable in the number of bits) and the page size, which
informs the entry-local match logic as to how many of the incoming VA bits
need to match in order to signal a hit for this entry.
| Quote: | ps - i teach a CMOS vlsi design class at UT Austin, and we're covering
RAMs, which led to this question
|
If you need a concrete example of an architecture with variable page sizes,
might I suggest Itanium ;-)
Hope this was helpful to you.
- Dale Morris
Itanium Processor Architect
Hewlett-Packard Co. |
|
| Back to top |
|
 |
robertwessel2@yahoo.com
Guest
|
Posted:
Thu Nov 10, 2005 1:15 am Post subject:
Re: TLB implementation: CAM vs SRAM |
|
|
adnan.aziz@gmail.com wrote:
| Quote: | i was curious as to why TLBs are implemented using CAMs.
specifically, it seems to me that the organization used in caches
(e.g., 4-way set associative) would work as well for a TLB as for a
D-cache - you can make much larger SRAMs than CAMs.
i'm guessing it has something to do with the fact that the page table
already gets the benefit of the cache, but it's not obvious. (maybe a
small CAM is faster than the cache?)
|
It can (and is) done both ways. The z990 processor, for example, uses
a two level TLB, the first level has two 512 entry TLBs (one in the
instruction path, the other in the data path), and the second level has
4.5K entries. All of the TLBs are four-way set associative. FWIW,
these extremely large TLB sizes are largely driven by the amount of
partitioning and virtualization typically done in mainframe
environments. |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Nov 10, 2005 1:15 am Post subject:
Re: TLB implementation: CAM vs SRAM |
|
|
Iain McClatchie wrote:
| Quote: | I don't really know the answer, but I do know that
- you need at least as much associativity in the TLB as in the cache,
usually more.
- you have a lot fewer entries in the TLB. That makes the number of
sets very small, which makes short bitlines, which gets the SRAM
off the sweet spot.
|
I might note that Opteron (and Athlon) use both CAM and RAM
technologies to implement the TLBs. The small-ish CAM alieviates
pressure when associativity is more important than capacity, while the
RAM improves performance when capacity is more important then
associativity. This combination outperforms the same area using only
RAM or CAM technologies alone. |
|
| Back to top |
|
 |
Guest
|
Posted:
Sat Nov 12, 2005 12:44 am Post subject:
Re: TLB implementation: CAM vs SRAM |
|
|
Dale Morris wrote:
[snip]
| Quote: | One reason is variable page sizes (for architectures that support them).
Caches typically have a fixed line size, and so the division between index
and tag for a set-associative cache can be done independent of the data
being accessed. But, for TLBs that can hold different page sizes, no static
division of address bits between index and tag gives you the optimal design
(where a mapping for a large page appears just once in the TLB, saving much
space over having multiple, smaller mappings).
You could do something like sectoring (as is done in caches), but this
doesn't give you much dynamic range, and it would mostly defeat the purpose
of using only a single TLB entry to map a large address range.
|
Note also "Concurrent Support of Multiple Page Sizes On a
Skewed Associative TLB" (Andre Seznec), Since this proposal
uses two ways of (skewed) associativity per page size (IIRC,
this results in a hit rate better than a 4-way regular associative
TLB.), it is less appropriate for supporting a very broad range of
page sizes (as in IPF), though one could use prediction (e.g.,
one page size or page-size set predictor per architected
address source register) to reduce the number of ways read
and so support a larger number of page sizes. Similarly, one
could use early way cancelation to reduce power consumption,
canceling the read and compare for that way if it does not
contain any pages of the appropriate size. For an L2 TLB,
using reindexing to check for a larger, less common page
size might be acceptable if the larger pages are supported
by the L1 TLB.
For a huge page size with multi-level or linear page tables, one
can also use separate tables for the huge pages and use the
same table to cache page directory entries (or the virtual address
range used for mapping the linear page table). This would
exploit the similarity in tag size and area coverage, allowing the
architect to reserve some area for a huge page TLB without
fully wasting that area if no huge pages are used. (One might
then consider using the other TLB as a victim cache for
PDEs/huge page PTEs; if few small pages are used relative to
PDEs [very sparse addressing] or huge pages, then less
recently used huge pages/PDEs would tend to survive in the
small page TLB at the latency cost of a reindexing; if many
small pages are highly active relative to huge pages/PDEs,
the huge page/PDE victims would tend to get evicted relatively
quickly for the size of the TLB.)
[snip]
| Quote: | If you need a concrete example of an architecture with variable page sizes,
might I suggest Itanium ;-)
|
IPF is somewhat exceptional in that it supports a large number
of page sizes and is architecturally required to have a modest-
sized fully associated TLB that supports all page sizes. (Also
its use of hashed page tables, particularly for fully flexible page
size support, makes the idea in my second paragraph far less
appropriate.)
Mitch Alsup's response that associativity is more important
than capacity at small sizes is a more appropriate explanation
for the use of CAMs in L1 TLBs in x86 generally and Itanium
(in which the L1 TLB only supports 4KiB pages). (Presumably
L1 TLBs are limited by latency [and layout?] issues more than
die area?)
Paul A. Clayton
just a technophile |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Nov 17, 2005 12:09 am Post subject:
Re: TLB implementation: CAM vs SRAM |
|
|
dear dale,
thanks for your answer - i was surprised at how sophisticated the
analysis was, my students have enjoyed the debate,
cheers,
adnan
---
Dale Morris wrote:
| Quote: | adnan.aziz@gmail.com> wrote in message
news:1131558407.194222.326600@g47g2000cwa.googlegroups.com...
i was curious as to why TLBs are implemented using CAMs.
specifically, it seems to me that the organization used in caches
(e.g., 4-way set associative) would work as well for a TLB as for a
D-cache - you can make much larger SRAMs than CAMs
One reason is variable page sizes (for architectures that support them).
Caches typically have a fixed line size, and so the division between index
and tag for a set-associative cache can be done independent of the data
being accessed. But, for TLBs that can hold different page sizes, no static
division of address bits between index and tag gives you the optimal design
(where a mapping for a large page appears just once in the TLB, saving much
space over having multiple, smaller mappings).
You could do something like sectoring (as is done in caches), but this
doesn't give you much dynamic range, and it would mostly defeat the purpose
of using only a single TLB entry to map a large address range.
With a CAM implementation of a TLB, each entry holds the virtual page number
(which may be variable in the number of bits) and the page size, which
informs the entry-local match logic as to how many of the incoming VA bits
need to match in order to signal a hit for this entry.
ps - i teach a CMOS vlsi design class at UT Austin, and we're covering
RAMs, which led to this question
If you need a concrete example of an architecture with variable page sizes,
might I suggest Itanium ;-)
Hope this was helpful to you.
- Dale Morris
Itanium Processor Architect
Hewlett-Packard Co. |
|
|
| Back to top |
|
 |
Dale Morris
Guest
|
Posted:
Tue Nov 22, 2005 1:15 am Post subject:
Re: TLB implementation: CAM vs SRAM |
|
|
Dear Adnan,
You are very welcome. It can be quite fascinating to study the interplay of
functional, performance and HW-implementation tradeoffs. Makes for a fun
job.
- Dale Morris
Itanium Processor Architect
Hewlett Packard
<adnan.aziz@gmail.com> wrote in message
news:1132164578.067822.15500@g44g2000cwa.googlegroups.com...
| Quote: | dear dale,
thanks for your answer - i was surprised at how sophisticated the
analysis was, my students have enjoyed the debate,
cheers,
adnan
---
Dale Morris wrote:
adnan.aziz@gmail.com> wrote in message
news:1131558407.194222.326600@g47g2000cwa.googlegroups.com...
i was curious as to why TLBs are implemented using CAMs.
specifically, it seems to me that the organization used in caches
(e.g., 4-way set associative) would work as well for a TLB as for a
D-cache - you can make much larger SRAMs than CAMs
One reason is variable page sizes (for architectures that support them).
Caches typically have a fixed line size, and so the division between
index
and tag for a set-associative cache can be done independent of the data
being accessed. But, for TLBs that can hold different page sizes, no
static
division of address bits between index and tag gives you the optimal
design
(where a mapping for a large page appears just once in the TLB, saving
much
space over having multiple, smaller mappings).
You could do something like sectoring (as is done in caches), but this
doesn't give you much dynamic range, and it would mostly defeat the
purpose
of using only a single TLB entry to map a large address range.
With a CAM implementation of a TLB, each entry holds the virtual page
number
(which may be variable in the number of bits) and the page size, which
informs the entry-local match logic as to how many of the incoming VA
bits
need to match in order to signal a hit for this entry.
ps - i teach a CMOS vlsi design class at UT Austin, and we're covering
RAMs, which led to this question
If you need a concrete example of an architecture with variable page
sizes,
might I suggest Itanium ;-)
Hope this was helpful to you.
- Dale Morris
Itanium Processor Architect
Hewlett-Packard Co.
|
|
|
| Back to top |
|
 |
|
|
|
|