| Author |
Message |
Rainer Weikusat
Guest
|
Posted:
Thu Oct 20, 2005 4:15 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Peter Dimov" <pdimov@gmail.com> writes:
| Quote: | Wilco Dijkstra wrote:
String
concatenation using STL is around 30-50 times slower than the equivalent
C for example - mostly due to the overhead of new/delete.
This is off-topic in all three groups, but I just can't let it slide.
If by "equivalent C" you mean in-place strcat into a fixed size buffer,
then no, the equivalent std::string operation (s1 += s2, where s1 has
enough capacity to hold the result) does neither new nor delete and is
nowhere near 30-50 times slower.
|
The assumption that unknown pieces of information can be fabricated
freely to support an arbitrary point is usually wrong. |
|
| Back to top |
|
 |
dick.dunbar
Guest
|
Posted:
Fri Oct 21, 2005 12:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
I also have an allocator that I've never published, and runs circles
around every other implementation I've raced, including hoard. It has
no fragmentation issues and has always had a smaller overall memory
footprint.
There's a few more improvements I can wring out of the code for
threaded environments.
The code is so simple, that I do think people could come up with the
same idea.
It's also so small, that it could be reverse-engineered by anybody in
an hour. |
|
| Back to top |
|
 |
Wilco Dijkstra
Guest
|
Posted:
Fri Oct 21, 2005 12:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Peter Dimov" <pdimov@gmail.com> wrote in message
news:1129818056.025495.215360@g47g2000cwa.googlegroups.com...
| Quote: | Wilco Dijkstra wrote:
String
concatenation using STL is around 30-50 times slower than the equivalent
C for example - mostly due to the overhead of new/delete.
This is off-topic in all three groups, but I just can't let it slide.
If by "equivalent C" you mean in-place strcat into a fixed size buffer,
then no, the equivalent std::string operation (s1 += s2, where s1 has
enough capacity to hold the result) does neither new nor delete and is
nowhere near 30-50 times slower.
|
If you're interested I can try finding the original benchmark - it was doing
something like:
string s = s1 + " " + s2 + " " + s3 + " " + s4;
It's the most obvious way of writing string concatenation, but in C++
it's as inefficient as you can get in terms of needless temporaries being
generated. Unfortunately due to C++ being a complex language, few
people have a detailed understanding of when temporaries are being
generated, and to make matters worse most compilers use more
temporaries than necessary...
Wilco |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Fri Oct 21, 2005 12:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | OK, in that case I'll bet you a beer (even though I don't drink the
stuff myself) that at least one, more probably several, of the regulars
here have already written something quite similar, and done it decades
ago.
Try reading the Wheeler posts!
|
;) |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Fri Oct 21, 2005 3:16 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | It's also so small, that it could be reverse-engineered by anybody in
an hour.
|
That's not true for my allocator. The whole code is somewhat above 1.300
lines with a large load of debug code. Nevertheless, it's amazingly fast
because in most cases the code runs through a very short path of the allo-
cation and deallocation-routine (which is inline) and the compiler is free
to strip out some code ouf of this f.e. when it detects that. |
|
| Back to top |
|
 |
Jens Meyer
Guest
|
Posted:
Fri Oct 21, 2005 5:18 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | The assumption that unknown pieces of information can be fabricated
freely to support an arbitrary point is usually wrong.
|
AAAAAAAAAAAAAAAH!
To international readers not familiar with this totally mentally disordered
guy: Rainer is a word-pettiflogger. He always finds some inprecise forumula-
tions and show that when you read the words literally and put them out of the
context, you're _totally_wrong_. And sometimes he states that it's extremely
dangerous to be inprecise in wording - even when no one of the other readers
had a problem to understand the words correctly.
He does this to feel free from such things himself and to consider himself
to be perfect. That's because in his subconciousness, he feels the opposite
of perfect.
I'm gonna repost this to other threads whenever Rainer will say something to
prevent that someone takes the effort of any senseless conversations with him. |
|
| Back to top |
|
 |
David Hopwood
Guest
|
Posted:
Fri Oct 21, 2005 5:19 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Eric Northup wrote:
| Quote: | Aligning up to a power-of-2 and then putting all the padding at the end
can cause pathological cache behavior on some architectures. See
section 4.1 ("Impact of Buffer Address Distribution on Cache
Utilization") of Jeff Bonwick's 94 USENIX paper "The Slab Allocator: An
Object-Caching Kernel Memory Allocator".
|
This is the result of the hash function that maps an address to a cache
line being too simple, depending only on some bits of the address. But
it is quite trivial to make the hash function depend on all bits of the
address, at very small hardware cost. So why is it not considered the
responsibility of hardware designers to get this right, rather than
the responsibility of software to work around poorly designed caches?
--
David Hopwood <david.nospam.hopwood@blueyonder.co.uk> |
|
| Back to top |
|
 |
Jens Meyer
Guest
|
Posted:
Fri Oct 21, 2005 5:20 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | If you're interested I can try finding the original benchmark - it was doing
something like:
string s = s1 + " " + s2 + " " + s3 + " " + s4;
It's the most obvious way of writing string concatenation, but in C++
it's as inefficient as you can get in terms of needless temporaries being
generated. Unfortunately due to C++ being a complex language, few
people have a detailed understanding of when temporaries are being
generated, and to make matters worse most compilers use more
temporaries than necessary...
|
You take the whole thing to general.
For most applications performance isn't a concern; so the above doesn't hurt. |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Fri Oct 21, 2005 5:45 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | Even up to a page you can get severe fragmentation for allocations
that are just larger than a power of 2.
|
Of course; but that's all for performance. The only relevant internal
fragmentation is the tail-padding of the block in terms of cacheline
-size.
| Quote: | Note the assumption that block sizes have a flat distribution is flawed,
|
I didn't assume that.
| Quote: | 25% fragmentation means 33% more memory, ...
|
So what! Performance is often bought with memory-consumption. And I'm
not really sure that the fragmentation of my allocator is that worse.
With usual allocators there might be a lot of blocks which aren't usable
because they don't meet any allocation-requests.
| Quote: | so you should see an increase in TLB and page misses.
|
I'm allocating the pages for the small blocks from a large pool of pages
in a *single* address-range to get maximum performance when looking up the
right slab-descriptor when doing a deallocation. To prevent any synchroni-
zation-overhead, this block can't be moved and therey can't be resized
once it has been initially allocated. So the OS is free to use large pages
for this address-range. I've written a little test which looks if this is
done by my OS I'm currently developing with (WinXP), and it is:
http://www.mallocator.org/TLBPerf.cpp
But even if the OS doesn't employ large pages, I doubt that the limited
number of tlb-entries would be a problem with my allocator.
| Quote: | Using more memory than you need is always a cost, ...
|
We're talking about small blocks up to a page here! That's not the type
of cost with isn't worth to be mentioned on the applications my allocator
is targeted for.
| Quote: | ... so it is worth spending a little extra time to reduce fragmentation.
|
Why are you so sure that the internal fragementation I have (which is only
noteworthy in terms of cacheline tail-blend) is worse than the external frag-
mentation of other allocators? Have you seen any invesigations on this?
| Quote: | CPUs still don't have that many TLB entries, I recall the initial Athlon64
increased it to 40 for its 64KByte L1 cache - so just 160KBytes of
memory can be addressed without incurring a TLB miss...
|
You consider this to be a problem; I don't believe that this is a problem
in most apps.
| Quote: | TLB misses are often underestimated, you can easily get many TLB
misses with so few entries despite hitting the L1 cache every time.
|
On what? Synthetic benchmarks?
| Quote: | Most applications aren't performance-crucial; so you can't make
a generalized no-no from this insight on std::string-performance.
I don't agree - it's precisely this kind of attitude
that has resulted in all the sluggish bloatware.
|
Yes, when you broaden this attitude too much, an app may be too slow.
But I'll bet that at least 95% of all allocations in all apps ever
written aren't performance-crucial.
| Quote: | You can't tell when something is performance critical or not ...
|
In most cases you can do this immediately without any deeper thoughts
on what you're writing. |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Fri Oct 21, 2005 5:46 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | Aligning up to a power-of-2 and then putting all the padding at the end
can cause pathological cache behavior on some architectures. See section
4.1 ("Impact of Buffer Address Distribution on Cache Utilization") of Jeff
Bonwick's 94 USENIX paper "The Slab Allocator: An Object-Caching Kernel
Memory Allocator".
|
Yes, that's possible; but I'll bet that this has proven only in micro-bench-
marks. *g* |
|
| Back to top |
|
 |
Maciej Sobczak
Guest
|
Posted:
Fri Oct 21, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Wilco Dijkstra wrote:
| Quote: | If you're interested I can try finding the original benchmark - it was doing
something like:
string s = s1 + " " + s2 + " " + s3 + " " + s4;
Unfortunately due to C++ being a complex language, few
people have a detailed understanding of when temporaries are being
generated
|
I think you are overestimating it. If you think that the above code
might be a performance bottleneck in your program and the only thing you
know about temporaries is that they are evil, then do this:
string s;
s.reserve(s1.size() + s2.size() + s3.size() + s4.size() + 3);
s += s1;
s += " ";
s += s2;
s += " ";
s += s3;
s += " ";
s += s4;
Granted - more work than originally, but if you compare it to
"equivalent C" then consider "equivalent C *code*" as well.
(I know - it's already off topic)
--
Maciej Sobczak : http://www.msobczak.com/
Programming : http://www.msobczak.com/prog/ |
|
| Back to top |
|
 |
Rainer Weikusat
Guest
|
Posted:
Fri Oct 21, 2005 3:30 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Jens Meyer <Jens.Meyer@quantentunnel.org> writes:
| Quote: | The assumption that unknown pieces of information can be fabricated
freely to support an arbitrary point is usually wrong.
AAAAAAAAAAAAAAAH!
To international readers not familiar with this totally mentally disordered
guy:
|
At least, I am sane enough to not use two different pseudonyms in the
same thread.
| Quote: | Rainer is a word-pettiflogger. He always finds some inprecise forumula-
tions and show that when you read the words literally and put them out of the
context, you're _totally_wrong_.
|
In the given context, nobody except the poster I answered to had been
talking about using 'strcat on a fixed size buffer of a known-to-be
sufficient size. This is therefore a 'classical' strawman. |
|
| Back to top |
|
 |
Andy Glew
Guest
|
Posted:
Fri Oct 21, 2005 4:15 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | Eric Northup wrote:
Aligning up to a power-of-2 and then putting all the padding at the end
can cause pathological cache behavior on some architectures.
David Hopwood <david.nospam.hopwood@blueyonder.co.uk> writes:
This is the result of the hash function that maps an address to a cache
line being too simple, depending only on some bits of the address. But
it is quite trivial to make the hash function depend on all bits of the
address, at very small hardware cost. So why is it not considered the
responsibility of hardware designers to get this right, rather than
the responsibility of software to work around poorly designed caches?
|
I tend to agree with you, except that I am reasonably certain that
incorporating many more bits into the hash function for the closest in
or L1 cache would slow the L1 cache down dramatically - e.g. froom 3
cycles to 4, or from 4 to 5.
* yes, just XORing in more address bits *will* cost that much more time
* not to mention the fact that, in a physically indexed cache, the extra
physical address bits are not available at the time the cache is indexed - since
the extra phuysical address bits come from the TLB, and are only available in time
to do the tag match (in fact, on Wmt they were only available several cycles
after writeback).
I.e. using a more complete hash function to generate the cache index
may well require switching to a virtually indexed cache. Which raises
issues of its own. And, even there, wiill be slower.
I think that it is more reasonable to consider more complete index
functions for the L2 cache accesses and further out. |
|
| Back to top |
|
 |
TJ Merritt
Guest
|
Posted:
Fri Oct 21, 2005 10:03 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Oliver S. wrote:
| Quote: | Note the assumption that block sizes have a flat distribution is flawed,
I didn't assume that.
|
Do some block sizes have less fragmentation than others?
| Quote: | 25% fragmentation means 33% more memory, ...
So what! Performance is often bought with memory-consumption. And I'm
not really sure that the fragmentation of my allocator is that worse.
With usual allocators there might be a lot of blocks which aren't usable
because they don't meet any allocation-requests.
|
Large amounts for fragmentation means, that less data fits in the L1
cache, less data fits in the L2 cache, less data fits in system memory,
lets data fits in the swap file. The fragmentation is costly in real
performance because the bandwidth between these memories is consumed by
elements that are not used. You can't have a top performing memory
allocator and have 33% of memory go unused due to fragmentation.
| Quote: | so you should see an increase in TLB and page misses.
I'm allocating the pages for the small blocks from a large pool of pages
in a *single* address-range to get maximum performance when looking up the
right slab-descriptor when doing a deallocation. To prevent any synchroni-
zation-overhead, this block can't be moved and therey can't be resized
once it has been initially allocated. So the OS is free to use large pages
for this address-range. I've written a little test which looks if this is
done by my OS I'm currently developing with (WinXP), and it is:
http://www.mallocator.org/TLBPerf.cpp
But even if the OS doesn't employ large pages, I doubt that the limited
number of tlb-entries would be a problem with my allocator.
|
One of the benchmarks the I use to test the dozen or more memory
allocators that I have written is
to allocate N blocks of M bytes. Link these blocks together using a
pointer at offset P within the
block. Start the timer and then chases serveral million pointers.
Depending upon how the allocator layed out the blocks the execution
time will vary considerably. Keep in mind we're not measuring the
allocation time, just the performance effects of the allocation
choices. Make sure that you run the benchmark with a wide variety of
choices for N and M. Say N*M < L1 size, N*M < L2 size, N*M == L2 size.
N*M == Memory size, N*M >>> Memory size.
| Quote: | Using more memory than you need is always a cost, ...
We're talking about small blocks up to a page here! That's not the type
of cost with isn't worth to be mentioned on the applications my allocator
is targeted for.
|
An allocation method that is great for one application doesn't make for
an allocation method that is even acceptable for a wide variety of
applications.
| Quote: | Why are you so sure that the internal fragementation I have (which is only
noteworthy in terms of cacheline tail-blend) is worse than the external frag-
mentation of other allocators? Have you seen any invesigations on this?
|
My best performing allocator has neither significant internal or
external fragmentation.
Previous allocator performance testing has shown that fragmentation is
a very significant performance issue and reducing fragmentation
increases application performance, especially when bumping up against
memory size boundaries.
| Quote: | CPUs still don't have that many TLB entries, I recall the initial Athlon64
increased it to 40 for its 64KByte L1 cache - so just 160KBytes of
memory can be addressed without incurring a TLB miss...
You consider this to be a problem; I don't believe that this is a problem
in most apps.
|
If the allocator lays out N blocks using N different pages and there
are only N-1 TLB entries and you walk the list of blocks twice, you'll
definitely feel the the effect. Run the benchmark described above and
you'll see this effect depending upon the allocator you test with.
| Quote: | TLB misses are often underestimated, you can easily get many TLB
misses with so few entries despite hitting the L1 cache every time.
On what? Synthetic benchmarks?
|
Real applications that do any reasonably amount of pointer chasing will
feel the effect.
| Quote: | Most applications aren't performance-crucial; so you can't make
a generalized no-no from this insight on std::string-performance.
I don't agree - it's precisely this kind of attitude
that has resulted in all the sluggish bloatware.
Yes, when you broaden this attitude too much, an app may be too slow.
But I'll bet that at least 95% of all allocations in all apps ever
written aren't performance-crucial.
|
The allocation is probably not performance critical, but the block
placement maybe. |
|
| Back to top |
|
 |
Casper H.S. Dik
Guest
|
Posted:
Fri Oct 21, 2005 10:14 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"TJ Merritt" <googlegroups@codegen.com> writes:
| Quote: | Large amounts for fragmentation means, that less data fits in the L1
cache, less data fits in the L2 cache, less data fits in system memory,
lets data fits in the swap file. The fragmentation is costly in real
performance because the bandwidth between these memories is consumed by
elements that are not used. You can't have a top performing memory
allocator and have 33% of memory go unused due to fragmentation.
|
Not to mention the application which used to run with 20% to spare which
will now die a horrible paging death.
| Quote: | An allocation method that is great for one application doesn't make for
an allocation method that is even acceptable for a wide variety of
applications.
|
That's probably why we have seven (7) or so memory allocators in Solaris,
including two which scale.
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth. |
|
| Back to top |
|
 |
|
|
|
|