| Author |
Message |
Niels Jørgen Kruse
Guest
|
Posted:
Wed Sep 28, 2005 4:15 pm Post subject:
Re: Implementation of pop count instruction |
|
|
Dan Koren <dankoren@yahoo.com> wrote:
| Quote: | ""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lh40.rv487f13k047aN%nospam@ab-katrinedal.dk...
Dan Koren <dankoren@yahoo.com> wrote:
I don't do no magic!!! As others have already
noticed, I am a very dumb person who likes to
borrow algorithms invented by others smarter
than I can ever be.
A table lookup implementation of a 32-bit POPC
function is shown on:
http://www.hackersdelight.org/HDcode/pop.cc
as the pop6() function.
Well, that is fairly trivial, but
how do you make it go in 2 clocks?
Courtesy Sun Studio 10 C compiler at
optimization level O5 plus inlining.
|
Byte at a time table lookup on a 8 byte operand would seem to require 8
load instructions. The USIII can hardly do 4 loads per clock!
Show us the timing loop. There must be opportunities for the compiler to
make things evaporate.
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Wed Sep 28, 2005 4:15 pm Post subject:
Re: Implementation of pop count instruction |
|
|
"Laurent Desnogues" <l-desnogues_delme@ti.com> wrote in message
news:dhds6k$jlq$1@home.itg.ti.com...
| Quote: |
Sorry Dan, I did not mean to put your skills in doubt, I
was just trying to get more information about a claim
that contradicts my observation.
|
Since you and I are using popc in different
algorithms to solve different problems using
different code there is really no contradiction
between the behaviors we observed.
dk |
|
| Back to top |
|
 |
Laurent Desnogues
Guest
|
Posted:
Wed Sep 28, 2005 4:15 pm Post subject:
Re: Implementation of pop count instruction |
|
|
Dan Koren wrote:
| Quote: |
If memory serves, the Ultra SPARC III+
can issue up to four instructions per
clock cycle -- can someone confirm or
refute this notion?
|
http://www.sun.com/processors/UltraSPARC-III/index.xml
seems to say so:
You will find the User Manual on that page.
Laurent |
|
| Back to top |
|
 |
Peter \"Firefly\" Lund
Guest
|
Posted:
Wed Sep 28, 2005 4:15 pm Post subject:
Re: Implementation of pop count instruction |
|
|
On Wed, 28 Sep 2005, Terje Mathisen wrote:
| Quote: | Which means that popc should be compiled with enough NOPs around it that
a trap handler can backpatch the opcode into a CALL doing the same job.
|
This has the same problem that a dynamic linker/loader has with
non-position independent code (very common on IA32).
Sharing code pages between processes and using files as backing store
becomes harder. As far as I know, Linux uses Copy-on-Write and gives each
process its own copy of relocated code pages. The patched pages no longer
use the executable file as backing store but will have to go to swap if
the physical memory is more needed by something else. They also won't be
kept around for caching purposes after the program terminates, so the next
program that uses the same library -- even loaded at the same virtual
address -- will have to reload the pages from the executable, patch them,
take a page fault doing that, etc.
I think the situation in the Windows world is similar but I dimly recall
some stuff from John Levine's book on linkers and loaders that suggested
they were doing something slightly smarter.
Of course, prelinking exists in both worlds -- choose a "most desired load
address" for each library and have a tool prepatch the associated values
into the file. As long as the library can be loaded at that address, you
get the sharing and caching stuff.
Have there been any (proprietary?) versions of Unix that handled sharing
in the face of load-time relocation and/or patching up to handle
instruction set variations?
-Peter |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Wed Sep 28, 2005 11:04 pm Post subject:
Re: Implementation of pop count instruction |
|
|
""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lq2w.17779m0a6xzp1N%nospam@ab-katrinedal.dk...
| Quote: | Dan Koren <dankoren@yahoo.com> wrote:
""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lh40.rv487f13k047aN%nospam@ab-katrinedal.dk...
Dan Koren <dankoren@yahoo.com> wrote:
I don't do no magic!!! As others have already
noticed, I am a very dumb person who likes to
borrow algorithms invented by others smarter
than I can ever be.
A table lookup implementation of a 32-bit POPC
function is shown on:
http://www.hackersdelight.org/HDcode/pop.cc
as the pop6() function.
Well, that is fairly trivial, but
how do you make it go in 2 clocks?
Courtesy Sun Studio 10 C compiler at
optimization level O5 plus inlining.
Byte at a time table lookup on a 8 byte
operand would seem to require 8 load
instructions. The USIII can hardly do
4 loads per clock!
|
The lookup table in my implementation
uses shorts instead of chars. However
I have timed both and the difference
wasn't dramatic.
In any case, I looked at the generated
assembly code, and it is quite clever.
Why don't you compile it -s and see
for yourself?
| Quote: | Show us the timing loop.
|
Huh ?!?
int n = 1000000000;
hrtime_t t0, t1;
double dt;
t0 = gethrtime ();
while (n--) popc (whatever);
t1 = gethrtime ();
dt = t1 - t0;
I don't doubt you could have written
this yourself ;-)
| Quote: | There must be opportunities for the
compiler to make things evaporate.
|
Only the procedure call since popc is
inlined using #pragma inline or the
equivalent compiler directives.
dk |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Wed Sep 28, 2005 11:07 pm Post subject:
Re: Implementation of pop count instruction |
|
|
"Dan Koren" <dankoren@yahoo.com> wrote in message
news:433adb13$1@news.meer.net...
| Quote: | ""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lq2w.17779m0a6xzp1N%nospam@ab-katrinedal.dk...
Dan Koren <dankoren@yahoo.com> wrote:
""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lh40.rv487f13k047aN%nospam@ab-katrinedal.dk...
Dan Koren <dankoren@yahoo.com> wrote:
I don't do no magic!!! As others have already
noticed, I am a very dumb person who likes to
borrow algorithms invented by others smarter
than I can ever be.
A table lookup implementation of a 32-bit POPC
function is shown on:
http://www.hackersdelight.org/HDcode/pop.cc
as the pop6() function.
Well, that is fairly trivial, but
how do you make it go in 2 clocks?
Courtesy Sun Studio 10 C compiler at
optimization level O5 plus inlining.
Byte at a time table lookup on a 8 byte
operand would seem to require 8 load
instructions. The USIII can hardly do
4 loads per clock!
The lookup table in my implementation
uses shorts instead of chars. However
I have timed both and the difference
wasn't dramatic.
In any case, I looked at the generated
assembly code, and it is quite clever.
Why don't you compile it -s and see
for yourself?
|
Just as a hint, I suppose you realize
that a good compiler can find ways to
eliminate loads of constants known at
compile time and replace at least some
of them with instructions that generate
the same constants from immediate values
through shifts, masks and bitwise logical
operations.
dk |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Wed Sep 28, 2005 11:42 pm Post subject:
Re: Implementation of pop count instruction |
|
|
Dan Koren wrote:
| Quote: | Huh ?!?
int n = 1000000000;
hrtime_t t0, t1;
double dt;
t0 = gethrtime ();
while (n--) popc (whatever);
t1 = gethrtime ();
dt = t1 - t0;
I don't doubt you could have written
this yourself ;-)
|
The code is fine, please show the generated asm!
| Quote: |
There must be opportunities for the
compiler to make things evaporate.
Only the procedure call since popc is
inlined using #pragma inline or the
equivalent compiler directives.
|
As soon as the compiler have inlined the table lookups, it is also free
to discover any additional optimizations, like the possibility that
'whatever' is constant across the loop.
I hope/assume this wasn't the case here?
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Wilco Dijkstra
Guest
|
Posted:
Thu Sep 29, 2005 12:05 am Post subject:
Re: Implementation of pop count instruction |
|
|
"Dan Koren" <dankoren@yahoo.com> wrote in message
news:433adb13$1@news.meer.net...
| Quote: | ""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lq2w.17779m0a6xzp1N%nospam@ab-katrinedal.dk...
Show us the timing loop.
Huh ?!?
int n = 1000000000;
hrtime_t t0, t1;
double dt;
t0 = gethrtime ();
while (n--) popc (whatever);
t1 = gethrtime ();
dt = t1 - t0;
|
So basically you are testing the performance of an empty loop...
1. you need to provide a unique value to popc each iteration
2. you need to reduce the results of popc into a single value
3. you need to do something non-trivial with the final result
If you don't do all of these then any decent compiler will create an
empty loop as a result of inlining, loop invariant code motion, dead
code elimination etc. For example, this will work:
int res = 0;
t0 = gethrtime ();
while (n--) res += popc (n);
t1 = gethrtime ();
printf("result %d\n",res);
Wilco |
|
| Back to top |
|
 |
Christian Bau
Guest
|
Posted:
Thu Sep 29, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
In article <433b12a3@news.meer.net>, "Dan Koren" <dankoren@yahoo.com>
wrote:
| Quote: | "Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:dheo6l$mmu$1@osl016lin.hda.hydro.com...
Dan Koren wrote:
Huh ?!?
int n = 1000000000;
hrtime_t t0, t1;
double dt;
t0 = gethrtime ();
while (n--) popc (whatever);
t1 = gethrtime ();
dt = t1 - t0;
I don't doubt you could have written
this yourself ;-)
The code is fine, please show the
generated asm!
?!?!?
I am no compiler, and I do not have
access to Sun Studio right now being
on vacation. I'm sure you don't need
my help for such an experimen, or do
you?
|
To be honest, if you claim that this machine can do a population count
of a 64 bit register in two cycles using a lookuptable, I would say "no
f***ing way" and "prove it!". |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Thu Sep 29, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> wrote in message
news:hQB_e.2706$O%.441@newsfe1-gui.ntli.net...
| Quote: |
"Dan Koren" <dankoren@yahoo.com> wrote in message
news:433adb13$1@news.meer.net...
""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lq2w.17779m0a6xzp1N%nospam@ab-katrinedal.dk...
Show us the timing loop.
Huh ?!?
int n = 1000000000;
hrtime_t t0, t1;
double dt;
t0 = gethrtime ();
while (n--) popc (whatever);
t1 = gethrtime ();
dt = t1 - t0;
So basically you are testing the performance of an empty loop..
|
No, because this is just a code skeleton, not a complete test.
| Quote: | 1. you need to provide a unique value to popc each iteration
2. you need to reduce the results of popc into a single value
3. you need to do something non-trivial with the final result
|
Thanks for the free tutorial.
The popc routine is used in a super heavy duty, super fast and
super scalable memory allocator, and it has been extensively
profiled. You don't expect me to publish the whole thing here.
Or do you? One thing is sure -- I'd get fired if I did.
| Quote: | If you don't do all of these then any decent compiler will create an empty
loop as a result of inlining, loop invariant code motion, dead code
elimination etc. For example, this will work:
int res = 0;
t0 = gethrtime ();
while (n--) res += popc (n);
t1 = gethrtime ();
printf("result %d\n",res);
|
Are you just fresh out of school and feel the urge to show off?
You seem to assume people don't know anything besides the most
naive and literally limited interpretation of each sentence.
dk
dk |
|
| Back to top |
|
 |
Dan Koren
Guest
|
Posted:
Thu Sep 29, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
"Terje Mathisen" <terje.mathisen@hda.hydro.com> wrote in message
news:dheo6l$mmu$1@osl016lin.hda.hydro.com...
| Quote: | Dan Koren wrote:
Huh ?!?
int n = 1000000000;
hrtime_t t0, t1;
double dt;
t0 = gethrtime ();
while (n--) popc (whatever);
t1 = gethrtime ();
dt = t1 - t0;
I don't doubt you could have written
this yourself ;-)
The code is fine, please show the
generated asm!
|
?!?!?
I am no compiler, and I do not have
access to Sun Studio right now being
on vacation. I'm sure you don't need
my help for such an experimen, or do
you?
| Quote: | There must be opportunities for the
compiler to make things evaporate.
Only the procedure call since popc is
inlined using #pragma inline or the
equivalent compiler directives.
As soon as the compiler have inlined the
table lookups, it is also free to discover
any additional optimizations, like the
possibility that 'whatever' is constant
across the loop.
I hope/assume this wasn't the case here?
|
Of course not.
You sound obsessively pedantic if I may
say so. If you're willing to take at
face value my statement that the code
has been extensively measured and tested
in a professional manner, fine. If not,
I'm not going to bother to do an online
design and/or code review. This is an
informal discussion group. Lighten up.
dk |
|
| Back to top |
|
 |
James Van Buskirk
Guest
|
Posted:
Thu Sep 29, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> wrote in message
news:hQB_e.2706$O%.441@newsfe1-gui.ntli.net...
| Quote: | int res = 0;
t0 = gethrtime ();
while (n--) res += popc (n);
t1 = gethrtime ();
printf("result %d\n",res);
|
Nope. Doesn't bust POPCNT function using monster LUT because
it doesn't probe problem space (pseudo-)randomly.
--
write(*,*) transfer((/17.392111325966148d0,6.5794487871554595D-85, &
6.0134700243160014d-154/),(/'x'/)); end |
|
| Back to top |
|
 |
Niels Jørgen Kruse
Guest
|
Posted:
Thu Sep 29, 2005 12:15 am Post subject:
Re: Implementation of pop count instruction |
|
|
Dan Koren <dankoren@yahoo.com> wrote:
| Quote: | ""Niels Jørgen Kruse"" <nospam@ab-katrinedal.dk> wrote in message
news:1h3lq2w.17779m0a6xzp1N%nospam@ab-katrinedal.dk...
Byte at a time table lookup on a 8 byte
operand would seem to require 8 load
instructions. The USIII can hardly do
4 loads per clock!
|
I checked and the USIII has only one L/S pipe.
| Quote: | The lookup table in my implementation
uses shorts instead of chars. However
I have timed both and the difference
wasn't dramatic.
In any case, I looked at the generated
assembly code, and it is quite clever.
Why don't you compile it -s and see
for yourself?
|
Gcc4.0 doesn't do anything clever. Didn't even avoid the signextend on
the loads.
I don't have a Sun. What did the compiler do?
--
Mvh./Regards, Niels Jørgen Kruse, Vanløse, Denmark |
|
| Back to top |
|
 |
John Savard
Guest
|
Posted:
Thu Sep 29, 2005 7:18 am Post subject:
Re: Implementation of pop count instruction |
|
|
On Mon, 26 Sep 2005 02:31:48 GMT, "Stephen Fuld"
<s.fuld@PleaseRemove.att.net> wrote, in part:
| Quote: | One could have a tree of adders
|
An illustration of how this could be done in hardware is given on my
site at
http://www.quadibloc.com/arch/ar0301.htm
| Quote: | So, for real CPUs that have this instruction, how is it implemented?
|
The Control Data 6600 used its division hardware.
John Savard
http://home.ecn.ab.ca/~jsavard/index.html
http://www.quadibloc.com/index.html
_________________________________________
Usenet Zone Free Binaries Usenet Server
More than 140,000 groups
Unlimited download
http://www.usenetzone.com to open account |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Thu Sep 29, 2005 7:45 am Post subject:
Re: Implementation of pop count instruction |
|
|
Can anyone tell me for what this popcnt-instruction should be good
for? I'm able to imagine some useful cases for such an instruction
but I can't imagine where such an instruction would give a perfor-
mance-boost of a complete algorithm. |
|
| Back to top |
|
 |
|
|
|
|