| Author |
Message |
Jeff Anderson-Lee
Guest
|
Posted:
Thu Jul 07, 2005 10:50 pm Post subject:
stalling the TSC? |
|
|
I'm seeing a curious phenomemon on some machines I am benchmarking.
One is an AMD Opetron, one is an Intel Xeon, and one is a VIA EDEN
processor. In all cases, using rdtsc returns an different number
of timestamps per microsecond depending on the code that is running.
A simple busy loop will typically return the maximal rate, but if I run
code that exercises both the CPU and memory a lot, the number of ticks
per microsecond slows down as if something is stalling the timestamp
counter. This runs contrary to what I would expect from a (useful)
timestamp counter.
Do the processors stall the TSC when they stall the pipeline?
Sign me puzzled.
Jeff Anderson-Lee
System Manager, PSI Project
ERSO, US Berkeley |
|
| Back to top |
|
 |
Colin Andrew Percival
Guest
|
Posted:
Thu Jul 07, 2005 10:53 pm Post subject:
Re: stalling the TSC? |
|
|
Jeff Anderson-Lee <jonah@dlp.cs.berkeley.edu> wrote:
| Quote: | A simple busy loop will typically return the maximal rate, but if I run
code that exercises both the CPU and memory a lot, the number of ticks
per microsecond slows down as if something is stalling the timestamp
counter. This runs contrary to what I would expect from a (useful)
timestamp counter.
|
The TSC depends upon the clock frequency, and the clock frequency can be
reduced if the processor is overheating. That's where I'd look first.
Colin Percival |
|
| Back to top |
|
 |
Andi Kleen
Guest
|
Posted:
Fri Jul 08, 2005 12:15 am Post subject:
Re: stalling the TSC? |
|
|
jonah@dlp.CS.Berkeley.EDU (Jeff Anderson-Lee) writes:
| Quote: | I'm seeing a curious phenomemon on some machines I am benchmarking.
One is an AMD Opetron, one is an Intel Xeon, and one is a VIA EDEN
processor. In all cases, using rdtsc returns an different number
of timestamps per microsecond depending on the code that is running.
A simple busy loop will typically return the maximal rate, but if I run
code that exercises both the CPU and memory a lot, the number of ticks
per microsecond slows down as if something is stalling the timestamp
counter. This runs contrary to what I would expect from a (useful)
timestamp counter.
Do the processors stall the TSC when they stall the pipeline?
|
VIA Eden stops TSC during HLT. This means when your benchmark
has some idle periods you will see it.
Opteron changes TSC frequency when powernow is active - the OS
can throttle the CPU when it is under low load then the TSC
runs slower.
This also implies it is not consistent between different CPUs
in a multi socket system.
On the Xeon if it's a modern one with speedstep TSC is constant frequency,
but the CPU can also run with a lower frequency. If that's the
case then the TSC will suddenly run faster compared to the
pipeline.
Hope this help,
-And |
|
| Back to top |
|
 |
glen herrmannsfeldt
Guest
|
Posted:
Fri Jul 08, 2005 12:15 am Post subject:
Re: stalling the TSC? |
|
|
Jeff Anderson-Lee wrote:
| Quote: | I'm seeing a curious phenomemon on some machines I am benchmarking.
One is an AMD Opetron, one is an Intel Xeon, and one is a VIA EDEN
processor. In all cases, using rdtsc returns an different number
of timestamps per microsecond depending on the code that is running.
|
Others have explained the variable clock rate.
I like to use RDTSC with the assumption that the number of clock
cycles won't change (too much) independent of the execution time.
I have even done it from Java. While there should be some
effects from task switching I haven't seen much on the
(small number) of tests that I have done.
-- glen |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Fri Jul 08, 2005 8:15 am Post subject:
Re: stalling the TSC? |
|
|
Somewhat related to this, and a question that I have asked before,
can someone explain to me WHY modern systems make such a pig's
ear of timing?
It is trivial to provide much more accurate and consistent timers,
both locally and globally, for LESS complexity than the current
ones. Yet everybody seems to copy the broken designs that were
first used as a way to hack timing into some of the really gruesome
1960s timerless architectures and their 1980s microprocessor clones.
Harking back to this thread, one classic error is to define a timer
as measuring a type of time A (say clock cycles or a real-time count)
and actually implemented it as a kludged-up type of time B (say the
other one). The ICL 1900 series did this, for example ....
But why? It is SO obviously stupid, because it never has worked.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Jeff Anderson-Lee
Guest
|
Posted:
Fri Jul 08, 2005 3:35 pm Post subject:
Re: stalling the TSC? |
|
|
Casper H.S. Dik <Casper.Dik@Sun.COM>
| Quote: | Andi Kleen <freitag@alancoxonachip.com> writes:
On the Xeon if it's a modern one with speedstep TSC is constant frequency,
but the CPU can also run with a lower frequency. If that's the
case then the TSC will suddenly run faster compared to the
pipeline.
Oh? That's a nice property.
|
It would be nice if I could just get the code to run at maximum speed.
| Is there a need for wall clock timing to GHz resolution?
Well, my programs ultimately produce their output in wallclock time
which is what I'm trying to optimize.
Jeff Anderson-Lee |
|
| Back to top |
|
 |
glen herrmannsfeldt
Guest
|
Posted:
Fri Jul 08, 2005 4:12 pm Post subject:
Re: stalling the TSC? |
|
|
Nick Maclaren wrote:
| Quote: | Somewhat related to this, and a question that I have asked before,
can someone explain to me WHY modern systems make such a pig's
ear of timing?
It is trivial to provide much more accurate and consistent timers,
both locally and globally, for LESS complexity than the current
ones. Yet everybody seems to copy the broken designs that were
first used as a way to hack timing into some of the really gruesome
1960s timerless architectures and their 1980s microprocessor clones.
|
Well, to me they measure two different things. RDTSC is nice for
actually measuring clock cycles independent of wall clock time.
I can compare two code sequences without knowing the clock frequency.
(I don't do it very often, but sometimes...)
As far as implementation, I don't see anything simpler than cycle counting.
For a global high resolution clock you would need a separate oscillator,
most likely phase locked to some external reference. Maybe it wouldn't
be all that hard to do. I don't know if an on chip PLL could phase lock
a 1GHz clock to a 32768Hz reference, though.
Is there a need for wall clock timing to GHz resolution?
-- glen |
|
| Back to top |
|
 |
Andi Kleen
Guest
|
Posted:
Fri Jul 08, 2005 4:15 pm Post subject:
Re: stalling the TSC? |
|
|
nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
| Quote: | Somewhat related to this, and a question that I have asked before,
can someone explain to me WHY modern systems make such a pig's
ear of timing?
|
The PCs have external timers which avoid all that of course. They're
just much slower to read compared to the TSC (factor 5-10 for HPET
which not everybody has yet, much worse for ACPI/PIT) and less
accurate.
Then there is the CPU internal TSC which is faster, but has
other problems.
So basically the modern systems are too fast for efficient timing :)
| Quote: | It is trivial to provide much more accurate and consistent timers,
both locally and globally, for LESS complexity than the current
ones. Yet everybody seems to copy the broken designs that were
first used as a way to hack timing into some of the really gruesome
1960s timerless architectures and their 1980s microprocessor clones.
|
Look up HPET. It's pretty modern and mostly usable with only minor
warts. Unfortunately many x86 vendors don't enable it yet (even when
the chipset has it in theory) because the current versions of Windows
don't use it and they don't enable anything that's not used by
Windows. This leaves other timers which have various problems, but can
be still used with some performance penalty.
-Andi |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Fri Jul 08, 2005 4:15 pm Post subject:
Re: stalling the TSC? |
|
|
In article <U4ednXCTMN_iwlPfRVn-gA@comcast.com>,
glen herrmannsfeldt <gah@ugcs.caltech.edu> writes:
|> Nick Maclaren wrote:
|>
|> > Somewhat related to this, and a question that I have asked before,
|> > can someone explain to me WHY modern systems make such a pig's
|> > ear of timing?
|>
|> > It is trivial to provide much more accurate and consistent timers,
|> > both locally and globally, for LESS complexity than the current
|> > ones. Yet everybody seems to copy the broken designs that were
|> > first used as a way to hack timing into some of the really gruesome
|> > 1960s timerless architectures and their 1980s microprocessor clones.
|>
|> Well, to me they measure two different things. RDTSC is nice for
|> actually measuring clock cycles independent of wall clock time.
|> I can compare two code sequences without knowing the clock frequency.
|> (I don't do it very often, but sometimes...)
You're missing my point. A cycle counter IS a timer - the fact
that it measures time by cycles rather than an approximation to
real-time seconds is irrelevant. Most "CPU timers" and related
measurements are actually tick counts not estimates of physical
time, too.
|> As far as implementation, I don't see anything simpler than
|> cycle counting.
On a very simple system, with a single, global, permanently running
clock, yes. It's not quite so simple with asynchronous or unclocked
logic, or even events that correspond to another clock entirely. It
becomes seriously complex when attempting to match timestamps between
separate CPUs.
|> For a global high resolution clock you would need a separate oscillator,
|> most likely phase locked to some external reference. Maybe it wouldn't
|> be all that hard to do. I don't know if an on chip PLL could phase lock
|> a 1GHz clock to a 32768Hz reference, though.
That is precisely NOT how to do it! I know that solution is taken
out of knee-jerk reflex by many hardware people, but that is merely
the result of poor teaching and lack of imagination. It doesn't
scale, can't be made to, and is generally a disaster in other
respects.
|> Is there a need for wall clock timing to GHz resolution?
That's not the right question. The current state is that there
is a major need for one with MHz consistency, and the typical
delivered consistency is mHz. It would be trivial to provide a
global clock for even quite large systems with GHz resolution
and local consistency, MHz global consistency and mHz accuracy.
And that would satisfy 99% of all known requirements.
As soon as you start to introduce non-trivial parallelism, you
start to hit problems that are best resolved by having a global
clock with a known resolution. The typical 'solution' of adding
global locks is sometimes necessary, but is always a disaster
when it isn't.
In particular, if you ever need to be able to sort packets from
other nodes into a globally consistent order (NOT necessarily
corresponding to the One True Physical Time), the availability of
a suitable global clock becomes critical. Yes, you can do it
without, but everything is vastly more complicated, less scalable
and slower.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Casper H.S. Dik
Guest
|
Posted:
Fri Jul 08, 2005 4:15 pm Post subject:
Re: stalling the TSC? |
|
|
Andi Kleen <freitag@alancoxonachip.com> writes:
| Quote: | VIA Eden stops TSC during HLT. This means when your benchmark
has some idle periods you will see it.
|
Ouch, that's not good. It's not supposed to do that.
| Quote: | Opteron changes TSC frequency when powernow is active - the OS
can throttle the CPU when it is under low load then the TSC
runs slower.
|
Similarly, all the Pentiums with SpeedStep will vary clock
frequency and TSC.
| Quote: | On the Xeon if it's a modern one with speedstep TSC is constant frequency,
but the CPU can also run with a lower frequency. If that's the
case then the TSC will suddenly run faster compared to the
pipeline.
|
Oh? That's a nice property.
Casper |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Fri Jul 08, 2005 10:12 pm Post subject:
Re: stalling the TSC? |
|
|
In article <p73d5ptpcip.fsf@verdi.suse.de>,
Andi Kleen <freitag@alancoxonachip.com> wrote:
| Quote: | nmm1@cus.cam.ac.uk (Nick Maclaren) writes:
Somewhat related to this, and a question that I have asked before,
can someone explain to me WHY modern systems make such a pig's
ear of timing?
The PCs have external timers which avoid all that of course. They're
just much slower to read compared to the TSC (factor 5-10 for HPET
which not everybody has yet, much worse for ACPI/PIT) and less
accurate.
Then there is the CPU internal TSC which is faster, but has
other problems.
So basically the modern systems are too fast for efficient timing :)
|
I was including most PCs and external timers in "making a pig's ear"
of it. The problem is NOT that modern systems are too fast for
efficient timing, but that it is usually (perhaps always) done so
spectacularly incompetently.
For example, ANY performance penalty shows gross incompetence. Why
on earth does (hardware) timing need CPU cycles? At most, it should
need a few dozen bytes of bus bandwidth once every microsecond or
so, and 8 bytes when the time is read.
| Quote: | Look up HPET. It's pretty modern and mostly usable with only minor
warts. Unfortunately many x86 vendors don't enable it yet (even when
the chipset has it in theory) because the current versions of Windows
don't use it and they don't enable anything that's not used by
Windows. This leaves other timers which have various problems, but can
be still used with some performance penalty.
|
Thanks. I will look at it and see if anyone is learning anything
from experience, even if not history :-)
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Mon Jul 11, 2005 2:18 am Post subject:
Re: stalling the TSC? |
|
|
Nick Maclaren wrote:
| Quote: | In article <p73d5ptpcip.fsf@verdi.suse.de>,
Andi Kleen <freitag@alancoxonachip.com> wrote:
Look up HPET. It's pretty modern and mostly usable with only minor
warts. Unfortunately many x86 vendors don't enable it yet (even when
the chipset has it in theory) because the current versions of Windows
don't use it and they don't enable anything that's not used by
Windows. This leaves other timers which have various problems, but can
be still used with some performance penalty.
Thanks. I will look at it and see if anyone is learning anything
from experience, even if not history :-)
|
I believe I wrote a post about this, it still has several stupid warts,
some of which could have made it much more useful if avoided. :-(
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Colonel Forbin
Guest
|
Posted:
Tue Jul 12, 2005 12:15 am Post subject:
Re: stalling the TSC? |
|
|
A lot of this has to do with market pressure. "Timing" to most people is
how long it takes IE to start when you click on the icon.
With all the indirection created by various layers of abstraction and
virtualization in modern CPU designs, delivering accurate "cycle counts"
(whatever that means) is not a high priority. People who want an
accurate timebase use an external hardware reference clock. People
who want to optimize performance at the microinstruction level are
being driven to a quantum mechanical statistical view of the world
in many cases.
Many things today are done poorly simply because there is a sufficient
market for "just good enough" and nobody cares about doing better
regardless of whether it might cost the same or even less to implement.
Unfortunately this creates a lot of dilemmas for people who want to
do things we did in the past with hardware whose characteristics were
more accurately and reliably documented.
Maybe we're reaching the point where, as physicists did a century ago,
we may have to at least partially abandon a deterministic worldview,
regardless of whether the reasons are related to market forces or
nature. |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Tue Jul 12, 2005 3:12 pm Post subject:
Re: stalling the TSC? |
|
|
In article <BYCAe.3830$B52.3024@tornado.ohiordc.rr.com>,
forbin@dev.nul (Colonel Forbin) writes:
|>
|> With all the indirection created by various layers of abstraction and
|> virtualization in modern CPU designs, delivering accurate "cycle counts"
|> (whatever that means) is not a high priority. People who want an
|> accurate timebase use an external hardware reference clock. ....
And, for the reasons I have posted in the past, that doesn't work.
It is trivial to get an external timestamp that is accurate to mS,
but that does not mean than a program can access it to that
accuracy, nor does it deliver uS global consistency, nor nS
resolution.
|> Many things today are done poorly simply because there is a sufficient
|> market for "just good enough" and nobody cares about doing better
|> regardless of whether it might cost the same or even less to implement.
I wouldn't say "nobody", but that is the reason. God alone knows
why they think like that, but they do. Few engineers do.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
Nick Maclaren
Guest
|
Posted:
Tue Jul 12, 2005 4:15 pm Post subject:
Re: stalling the TSC? |
|
|
In article <daudul$dlh$1@osl016lin.hda.hydro.com>,
Terje Mathisen <terje.mathisen@hda.hydro.com> writes:
|> Nick Maclaren wrote:
|> > In article <p73d5ptpcip.fsf@verdi.suse.de>,
|> > Andi Kleen <freitag@alancoxonachip.com> wrote:
|>
|> >>Look up HPET. It's pretty modern and mostly usable with only minor
|> >>warts. Unfortunately many x86 vendors don't enable it yet (even when
|> >>the chipset has it in theory) because the current versions of Windows
|> >>don't use it and they don't enable anything that's not used by
|> >>Windows. This leaves other timers which have various problems, but can
|> >>be still used with some performance penalty.
|> >
|> > Thanks. I will look at it and see if anyone is learning anything
|> > from experience, even if not history :-)
|>
|> I believe I wrote a post about this, it still has several stupid warts,
|> some of which could have made it much more useful if avoided. :-(
A brief look at it indicates that it has more warts that Oliver
Cromwell, and addresses only a small part of the problem. It may
be better than what was there before, but it is still ghastly.
Consider the following 'minor' issues:
Real-time accuracy, including resynchronisation after coming
out of S1 and S2 (sleep?) states. Like, none.
Maintaining consistency across SMP systems. Like, none.
Integrating it with any 'GHz' timer (i.e. a cycle counter).
Like, none.
500 ppm? And only for periods of over 1 mS? And the wording
is such that it is allowed to be be 18,000 ppm out for intervals
of just over 100 uS :-)
I like femptoseconds - clearly a short interval with no content.
Regards,
Nick Maclaren. |
|
| Back to top |
|
 |
|
|
|
|