Pentium M to become THE CPU
CASTalk.com Forum Index CASTalk.com
Discussion of DSP, FPGA, storage and embedded system.
 
 FAQFAQ   MemberlistMemberlist     RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 
 
Google
 
Web castalk.com
Pentium M to become THE CPU
Goto page Previous  1, 2, 3, 4, 5, 6  Next
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture
Author Message
Andi Kleen
Guest





Posted: Wed Oct 12, 2005 4:15 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

MitchAlsup@aol.com writes:

[quote]Andi Kleen wrote:
First the memory controller no matter if integrated or not is a
bottleneck for any CPU given a sufficient fast workload. That's
simply because the DIMMs cannot keep up with the CPU.

I would say in practice for a normal desktop machine or a laptop
the limit is how much bandwidth two DIMMs can deliver.

90%+ of the time, the problem is NOT bandwidth, but Latency. The on-die
memory controller gets rid of all of the FSB (latency adding) cycles.
[/quote]
No argument on that latency is important, and the IMC wins on latency.

In practice a good single CPU P4 system with 800Mhz FSB and a good
memory controller has about twice as much memory latency as a A64
(~45ns vs ~90ns[1]).

I suspect if Intel cranks up the FSB frequency to 1Ghz or more and
possibly increase the frequency of their chipsets they can get that
down. So with some improvements they might get the latency down a bit
more (let's say only 30-40% penalty which they might make up with
other tricks like more cache) and have comparable or better bandwidth
(if they as usual surpass AMD in faster DRAM support), so it doesn't
look too bad for the next time for a single socket/dual core system at
least.

[1] Actually with lmbench a newer Intel dual core systems reports a lower
memory latency to me than on an A64, but I suspect their prefetch
algorithms became so good they broke lmbench ;-)

[quote]In Opteron, for example, the address associated with an L2 miss can
arrive at the memory controller in less than 2ns, and data arriving at
the pins from the DIMMs can arrive back at the CPU in a similar number.
[/quote]
.... if you don't have to wait for the cache probe responses of
the other CPUs.

[quote]In addition the on-die approach with the HyperTransport fabric
interconnect gives you the property that as you add CPUs, you also add
DRAM bandwidth and bisection bandwidth. A 4 Node Opteron system has ~4
times as much DRAM bandwidth as a 4 node Pentium (single) FSB system
and plenty of chip-to-chip bandwidth to route the data to where it is
needed.
[/quote]
Yes, for multi socket systems the Opteron NUMA setup is clearly a winner
right now.

-Andi (partly playing devil's advocate here)
Back to top
Oliver S.
Guest





Posted: Wed Oct 12, 2005 4:15 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

[quote]No argument on that latency is important, and the IMC wins on latency.

In practice a good single CPU P4 system with 800Mhz FSB and a good
memory controller has about twice as much memory latency as a A64
(~45ns vs ~90ns[1]).
[/quote]
And that's not mainly because of the FSB, but the double cache-line
length of the P4!

[quote][1] Actually with lmbench a newer Intel dual core systems reports a
lower memory latency to me than on an A64, but I suspect their prefetch
algorithms became so good they broke lmbench ;-)
[/quote]
How should a prefetching-algorithm break a memory-latency benchmark?
Wenn we have the strongest latency-demands when doing pointer-chasing
and thereby disabling out-of-order execution, even hardware-scouting
won't help!
Back to top
Seongbae Park
Guest





Posted: Wed Oct 12, 2005 4:44 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

Jens Meyer <Jens.Meyer@quantentunnel.de> wrote:
[quote]You forgot to consider a major latency-factor: the cacheline-size. The
P4 has a stupid cacheline-size of 128 bytes (16 times the bus-width!)
in the L2- and L2-caches, whereas the P3, the Pentium-M and all Athlons
have a more reasonable cacheline-size of 64 bytes on all cache-levels.
[/quote]
An interesting claim.

Some, if not most, variants of recent Pentium 4 (smithfield ?)
have L2 cache with 64-byte line size.

That said, I wouldn't call 128-byte line L2 as "stupid".
It depends on the workload, the cache size and memory bandwidth among others.
With 1MB or 2MB L2, I think there are enough number of L2 lines (2M/128 = 16k)
although it would be a problem if L2 were small (e.g. 128K L2
would mean measly 1k L2 lines if the line size is 128 bytes).
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
Back to top
Andi Kleen
Guest





Posted: Wed Oct 12, 2005 9:06 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

Jens Meyer <Jens.Meyer@quantentunnel.de> writes:

[quote]You forgot to consider a major latency-factor: the cacheline-size. The
P4 has a stupid cacheline-size of 128 bytes (16 times the bus-width!)
in the L2- and L2-caches, whereas the P3, the Pentium-M and all Athlons
have a more reasonable cacheline-size of 64 bytes on all cache-levels.
[/quote]
A modern bus should do critical word first, so I wouldn't expect
this to be a large disadvantage (given they have enough bandwidth,
which they will probably have on a single socket systems)

-Andi
Back to top
Seongbae Park
Guest





Posted: Wed Oct 12, 2005 9:18 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

Oliver S. <Follow.Me@gmx.net> wrote:
[quote]Some, if not most, variants of recent Pentium 4 (smithfield ?)
have L2 cache with 64-byte line size.

No! Only the L1-cachelines are 64 bytes on P4-incarnations and all
other cachelines in all other cache-levels are 128 bytes long! The
L2-cacheline have two flags that indicate whether their halves are
dirty so that the cache can write back only a half cacheline.
[/quote]
Ah. Apparently the document I looked at was wrong.
Intel's website makes it clear that it's 128B line with two subblocks.

[quote]That said, I wouldn't call 128-byte line L2 as "stupid".

A longer cacheline makes only sense when the time while loading a cacheline
is too short to issue the next load on the address-bus; SDRAM-technologies
allow overlapping the read/write-requests with the bursts. If the cacheline
-loads are longer than the requests, you can reach the maximum throughput
without further increasing the cacheline-size.

It depends on the workload, the cache size and memory bandwidth among others.

No, it depends only on the above consideration.
[/quote]
I don't agree with you that
a sensible L2 line size can be determined by only a single factor.
Given a fixed bus width,
you're right that bigger L2 line size may not give you any better throughput.
But that isn't the only consideration when you design a chip.
There's always some trade off between
associativity, cache size, line size, subblocking,
die size, clock frequency, etc.
Smaller line size needs more number of outstanding L2 miss
to hit the same bandwidth and that has cost as well.
Also, cache hit/miss rate due to spatial locality changes depending
on L2 line size and the amount of change is of course
dependent on the workload characteristic.
--
#pragma ident "Seongbae Park, compiler, http://blogs.sun.com/seongbae/"
Back to top
Andi Kleen
Guest





Posted: Wed Oct 12, 2005 9:47 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

Seongbae Park <Seongbae.Park@Sun.COM> writes:
[quote]
Some, if not most, variants of recent Pentium 4 (smithfield ?)
have L2 cache with 64-byte line size.
[/quote]
It only does bus IO in 128 byte chunks, so the effective one is 128 bytes.

[quote]It depends on the workload, the cache size and memory bandwidth among others.
With 1MB or 2MB L2, I think there are enough number of L2 lines (2M/128 = 16k)
although it would be a problem if L2 were small (e.g. 128K L2
would mean measly 1k L2 lines if the line size is 128 bytes).
[/quote]
You described a Celeron (ok they come with 256K now)

-Andi
Back to top
Oliver S.
Guest





Posted: Wed Oct 12, 2005 9:56 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

[quote]Some, if not most, variants of recent Pentium 4 (smithfield ?)
have L2 cache with 64-byte line size.
[/quote]
No! Only the L1-cachelines are 64 bytes on P4-incarnations and all
other cachelines in all other cache-levels are 128 bytes long! The
L2-cacheline have two flags that indicate whether their halves are
dirty so that the cache can write back only a half cacheline.

[quote]That said, I wouldn't call 128-byte line L2 as "stupid".
[/quote]
A longer cacheline makes only sense when the time while loading a cacheline
is too short to issue the next load on the address-bus; SDRAM-technologies
allow overlapping the read/write-requests with the bursts. If the cacheline
-loads are longer than the requests, you can reach the maximum throughput
without further increasing the cacheline-size.

[quote]It depends on the workload, the cache size and memory bandwidth among others.
[/quote]
No, it depends only on the above consideration.
Back to top
Niels Jørgen Kruse
Guest





Posted: Wed Oct 12, 2005 10:45 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

Andi Kleen <freitag@alancoxonachip.com> wrote:

[quote]Seongbae Park <Seongbae.Park@Sun.COM> writes:

Some, if not most, variants of recent Pentium 4 (smithfield ?)
have L2 cache with 64-byte line size.

It only does bus IO in 128 byte chunks, so the effective one is 128 bytes.
[/quote]
You have the prospect of being able to age out the half of the 128 byte
chunk that is unused.

--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark
Back to top
Scott Alfter
Guest





Posted: Wed Oct 12, 2005 10:46 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

In article <DK93f.15769$6e1.7041@newssvr14.news.prodigy.com>,
Kelly Hall <khall@acm.org> wrote:
[quote]Nathan Bates wrote:
Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

I'm still banking on the 8051 - that thing just won't go down.
[/quote]
8051? The 6502 pwnz j00. :-)

(One of these days, I'll build a controller for my beer fridges so I can
free up the Apple IIs that are currently running them (a IIGS on one and a
IIe on the other). To simplify the software-porting effort, it'll most
likely be built around a 6502, or something compatible with it. It's not
like monitoring the temperature and switching the compressor on and off
requires dual Opterons or something insane like that.)

_/_
/ v \ Scott Alfter (remove the obvious to send mail)
(IIGS( http://alfter.us/ Top-posting!
\_^_/ rm -rf /bin/laden >What's the most annoying thing on Usenet?

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.0 (GNU/Linux)

iD8DBQFDTUyNVgTKos01OwkRAsUhAJ9kN4nuuE4i+96oOxbhesnsJJXbRwCg0QhY
P7m0D2oFIgMO6s0E3MMzhEs=
=hAC1
-----END PGP SIGNATURE-----
Back to top
Andi Kleen
Guest





Posted: Wed Oct 12, 2005 11:31 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

nospam@ab-katrinedal.dk (Niels Jørgen Kruse) writes:

[quote]Andi Kleen <freitag@alancoxonachip.com> wrote:

Seongbae Park <Seongbae.Park@Sun.COM> writes:

Some, if not most, variants of recent Pentium 4 (smithfield ?)
have L2 cache with 64-byte line size.

It only does bus IO in 128 byte chunks, so the effective one is 128 bytes.

You have the prospect of being able to age out the half of the 128 byte
chunk that is unused.
[/quote]
And what would you do with the empty half if you cannot refill it?

-Andi
Back to top
Guest






Posted: Wed Oct 12, 2005 11:33 pm    Post subject: Re: Pentium M to become THE CPU Reply with quote

Oliver S. wrote:
[quote]A 4 Node Opteron system has ~4 times as much DRAM bandwidth as a 4
node Pentium (single) FSB system and plenty of chip-to-chip bandwidth
to route the data to where it is needed.

It has about four times the store-bandwith - but not the load-bandwidth
due to speculative snoops.
[/quote]
The 4 processors in a 4-Node fabric system can make 4 read accesses (L2
miss) to 4 memory controllers. Say one isolated access takes 120ns with
Probe delays, all 4 of these accesses can complete in only 130ns when
each request is serviced by a different memory controller! Including
Probe delays! In addition, a single CPU can make 4 overlapped requests
to 4 different memory controllers and achieve all of its data and
probes back in only 130ns.

And while all of the above is going on, other processors can be using
the remaining fabric link bandwidths to transmit store-data to the
various memory controllers,....

Mitch
<Note: memory delays are illustrative and vary from system to system>
Back to top
Oliver S.
Guest





Posted: Thu Oct 13, 2005 12:15 am    Post subject: Re: Pentium M to become THE CPU Reply with quote

[quote]And what would you do with the empty half if you cannot refill it?
[/quote]
A half is never discarded in favor of another data from the higher
cache-levels. The flags are only there to enable the CPU to perform
a partitial write-back of a cacheline when it is completely evicted
from the cache-level.
Back to top
Oliver S.
Guest





Posted: Thu Oct 13, 2005 12:15 am    Post subject: Re: Pentium M to become THE CPU Reply with quote

[quote]A longer cacheline makes only sense when the time while loading a cacheline
is too short to issue the next load on the address-bus; SDRAM-technologies
allow overlapping the read/write-requests with the bursts. If the cacheline
-loads are longer than the requests, you can reach the maximum throughput
without further increasing the cacheline-size.

It depends on the workload, the cache size and memory bandwidth among others.

No, it depends only on the above consideration.

There's always some trade off between associativity,
cache size, line size, subblocking, ...
[/quote]
Of course, but L2- and L3-caches aren't that constrained on that
aspects like a L1-cache.

[quote]Smaller line size needs more number of outstanding L2 miss
to hit the same bandwidth and that has cost as well.
[/quote]
With a OoO-core its quite easy to satisfy a high number of outstanding
parallel load-requests; f.e. load-requests that consecutively access a
cacheline and its following cacheline in the same SDRAM-page and thereby
seamlessly doing two adjacent bursts. And when you don't have accesses
on consecutive memory-adresses, the shorter cachelines help to lower
the latency.
Back to top
Oliver S.
Guest





Posted: Thu Oct 13, 2005 12:15 am    Post subject: Re: Pentium M to become THE CPU Reply with quote

[quote]Pentium M has all the right ingredients for total world domination:
low power consumption, short pipeline stages, hi-performance.

Are they x86-64?
[/quote]
Even the upcoming dual-core incarnation of the Pentium-M won't
be capable of x86-64 and it seems that this version even miss VT
technology; but both are not strong arguments for Notebook-systems.
Back to top
Oliver S.
Guest





Posted: Thu Oct 13, 2005 12:15 am    Post subject: Re: Pentium M to become THE CPU Reply with quote

[quote]The 4 processors in a 4-Node fabric system can make 4 read
accesses (L2 miss) to 4 memory controllers.
[/quote]
Ok, you're right here. But feeding threads with physical memory-pages of
the current CPU's memory-controller is rather complicated when the threads
are able to migrate between the cores and memory-pages are shared between
the cores. To handle this efficiently, there would have to be counters on
each page-table-entry (PTE) which count the read and write-accesses of the
CPU currently using a certain page-table; this would help to migrate logi-
cal pages from physical pages of one CPU to physical pages of another CPU
by the OS to maximize performance. But unfortunately AMD has missed to add
this feature to the AMD64 page-tables so that you usually have a lot of
interconnnect-traffic on loads. So for high-performance-applications you
would have to nail a thread to a certain CPU and use an API that would give
you memory-pages that physically map to the memory attached to the CPU.
But I'm not aware of any OS that supports this processor-affine allocations.
And with even more improved PTEs with a flag that prevents the snoop-broad-
casts it would be possible to drop the snoop-broadcasts. But unfortunately
AMD64 also doesn't include this feature.
Back to top
 
Post new topic   Reply to topic    CASTalk.com Forum Index -> Computer Architecture All times are GMT
Goto page Previous  1, 2, 3, 4, 5, 6  Next
Page 2 of 6

 
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




VoIP Electronics Powered by phpBB