| Author |
Message |
Oliver S.
Guest
|
Posted:
Wed Oct 19, 2005 8:15 am Post subject:
Micro-Benchmark for Memory-Allocators |
|
|
I've developed a extremly performant general-purpose memory allocator (it runs
currently under windows, but should be adapted for Unices in an hour). I'm very
confident it will outperform HOARD and many others both in singlethreaded as well
as in multithreaded scenarios. But the performance comes at a price of 25% average
internal fragmentation of every memory-block and some "fragementation" due to pri-
vate ranges threads reserve for themselfes. But the architecture allows this allo-
cator to scale nearly linearely with the number of processors.
I'm currently not sure whether I should make this open-source without any limita-
tions, patent the algorithms and leave it open-source and non-commercial for non
-profit projects (I've got a date with a patent-attorney on thursday) or get it a
completely commercial close-source component without any patented algorithms.
But I'm searchig for some common micro-benchmarks to get excact numbers on how
my allocator performs against others like HOARD. Does anyone know if there are
common benchmarks for this?
Thanks in advance! |
|
| Back to top |
|
 |
Casper H.S. Dik
Guest
|
Posted:
Wed Oct 19, 2005 10:47 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Oliver S." <Follow.Me@gmx.net> writes:
| Quote: | I've developed a extremly performant general-purpose memory allocator (it runs
currently under windows, but should be adapted for Unices in an hour). I'm very
confident it will outperform HOARD and many others both in singlethreaded as well
as in multithreaded scenarios. But the performance comes at a price of 25% average
internal fragmentation of every memory-block and some "fragementation" due to pri-
vate ranges threads reserve for themselfes. But the architecture allows this allo-
cator to scale nearly linearely with the number of processors.
I'm currently not sure whether I should make this open-source without any limita-
tions, patent the algorithms and leave it open-source and non-commercial for non
-profit projects (I've got a date with a patent-attorney on thursday) or get it a
completely commercial close-source component without any patented algorithms.
But I'm searchig for some common micro-benchmarks to get excact numbers on how
my allocator performs against others like HOARD. Does anyone know if there are
common benchmarks for this?
|
Hoard comes with a number of tests you could run but you should also
look at a number of other designs. Hoard certainly isn't state-of-the-art.
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth. |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Wed Oct 19, 2005 11:42 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | Hoard comes with a number of tests you could run but you should also
look at a number of other designs. Hoard certainly isn't state-of-the-art.
|
Really? I thought that hord is very competitive. But this may be becuause
of the constraints imposed on the implementation, i.e. the fragmentation
and so on. My allocator isn't that constrained and f.e. acceppts anverage
internal fragmentation of 25% on every allocated memory-block. But the
results are really amazing; in the tests I did, my allocator was between
5,5 and sixteen times faster than that of the Visual-C++ runtime library. |
|
| Back to top |
|
 |
Wilco Dijkstra
Guest
|
Posted:
Thu Oct 20, 2005 12:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Oliver S." <Follow.Me@gmx.net> wrote in message
news:dj640m$k95$1@domitilla.aioe.org...
| Quote: | Hoard comes with a number of tests you could run but you should also
look at a number of other designs. Hoard certainly isn't
state-of-the-art.
Really? I thought that hord is very competitive. But this may be becuause
of the constraints imposed on the implementation, i.e. the fragmentation
and so on. My allocator isn't that constrained and f.e. acceppts anverage
internal fragmentation of 25% on every allocated memory-block. But the
results are really amazing; in the tests I did, my allocator was between
5,5 and sixteen times faster than that of the Visual-C++ runtime library.
|
25% fragmentation sounds like rounding up block sizes to a power of 2 -
not a good idea. Allowing a lot of fragmentation is bad as performance
drops due to less locality and more TLB/page misses. This effect can be
significant in my experience, and you only notice it when you're using the
allocated memory for real - typically not in a micro benchmark.
Comparing against the VC++ allocator is not that interesting - it is one of
the slowest allocators out there, even the single threaded version. String
concatenation using STL is around 30-50 times slower than the equivalent
C for example - mostly due to the overhead of new/delete. It's why STL
and new are big no-no's in my book, and region based allocators are my
favorite as they provide the fatest possible allocation for small block
sizes.
To answer your question on benchmarking, I would try to run it on real
applications as well. Micro benchmarks are just that, and may overstate
the benefit (for example complex tree data structures will be cached in
a benchmark but less likely in a real application).
Finally, patenting is only worth the money if you're sure nobody has done
something similar already and you're planning to use your allocator in
a commercial product.
Wilco |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Thu Oct 20, 2005 5:25 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | 25% fragmentation sounds like rounding up block sizes to a power
of 2 - not a good idea.
|
That's excactly what I'm doing. I think that's quite ok on desktop- and
even more for server-systems for performance-critical applications. Other
apps can stich with less aggressivee allocators. But this round-up occurs
"only" on blocks up to one page and the small-block pool can only grow up
to a predefined limit whic is 64MB by default.
| Quote: | Allowing a lot of fragmentation is bad as performance drops due to less
locality and more TLB/page misses.
|
The raise on TLB misses shouldn't be woth to be mentioned with an average
internal block-fragmentation of 25%.And my allocator has some optimizations
which give a thread memory-blocks which have been freed recently to impove
cache-recycling.
| Quote: | This effect can be significant in my experience, ...
|
I'll bet my right hand that the raise in TLB misses isn't noteworthy in
terms of performance-reduction. And the tail-padding of each-block that
causes an internal fragmentation in terms of unused cacheline-storage of
the cache is also very small.
| Quote: | String concatenation using STL is around 30-50 times slower than the
equivalent C for example - mostly due to the overhead of new/delete.
|
That's one thing I often was annoyed about. I'd rather see a template
string-clas with a template-parameter that specifies a size for an inline
buffer of the string so that all operations up to this size go to the in-
line-buffer. That would be OK at least for string-objects allocated on
the stack.
| Quote: | It's why STL and new are big no-no's in my book, ...
|
Most applications aren't performance-crucial; so you can't make
a generalized no-no from this insight on std::string-performance.
| Quote: | and region based allocators are my favorite as they provide
the fatest possible allocation for small block sizes.
|
Of course there are faster *special-purpose* allocators.
| Quote: | To answer your question on benchmarking, I would try to run it on real
applications as well. Micro benchmarks are just that, and may overstate
the benefit (for example complex tree data structures will be cached in
a benchmark but less likely in a real application).
|
Of course; but to compare the allocators among their siblings its ok. |
|
| Back to top |
|
 |
Jason Ozolins
Guest
|
Posted:
Thu Oct 20, 2005 6:24 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Wilco Dijkstra wrote:
| Quote: | Finally, patenting is only worth the money if you're sure nobody has done
something similar already and you're planning to use your allocator in
a commercial product.
|
Heavens no. Why write some software yourself when you can wait for someone
else to write it, and then threaten to sue them for patent infringement? As
long as you pick companies to shake down that are profitable but not big
enough to contest a lawsuit, you'll make out like bandits.
Intellectual property speculation: it's just as valuable to society as land
speculation!
-Jason =:^( |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Thu Oct 20, 2005 7:02 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | Heavens no. Why write some software yourself when you can wait for
someone else to write it, and then threaten to sue them for patent
infringement? As long as you pick companies to shake down that are
profitable but not big enough to contest a lawsuit, you'll make out
like bandits.
|
I think that the architecture of my allocator is that special, that
it's extremely unlikely that will someone will come up with the same
idea. |
|
| Back to top |
|
 |
Greg Lindahl
Guest
|
Posted:
Thu Oct 20, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
In article <dj4dnq$95g$1@domitilla.aioe.org>,
Oliver S. <Follow.Me@gmx.net> wrote:
| Quote: | I've developed a extremly performant general-purpose memory allocator
|
I'd recommend using real programs as a test, instead of
micro-benchmarks. Yes, it's more work, but yes, it's more likely to
tell you what's really going on -- you can easily be fooled by a
micro-benchmark.
-- greg |
|
| Back to top |
|
 |
Eric Northup
Guest
|
Posted:
Thu Oct 20, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Oliver S. wrote:
| Quote: | 25% fragmentation sounds like rounding up block sizes to a power
of 2 - not a good idea.
That's excactly what I'm doing.
[...]
Allowing a lot of fragmentation is bad as performance drops due to less
locality and more TLB/page misses.
[...]
This effect can be significant in my experience, ...
I'll bet my right hand that the raise in TLB misses isn't noteworthy in
terms of performance-reduction. And the tail-padding of each-block that
causes an internal fragmentation in terms of unused cacheline-storage of
the cache is also very small.
|
Aligning up to a power-of-2 and then putting all the padding at the end
can cause pathological cache behavior on some architectures. See
section 4.1 ("Impact of Buffer Address Distribution on Cache
Utilization") of Jeff Bonwick's 94 USENIX paper "The Slab Allocator: An
Object-Caching Kernel Memory Allocator".
--Eric |
|
| Back to top |
|
 |
Chris Dodd
Guest
|
Posted:
Thu Oct 20, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | "Oliver S." <Follow.Me@gmx.net> wrote
But the
results are really amazing; in the tests I did, my allocator was between
5,5 and sixteen times faster than that of the Visual-C++ runtime library.
|
"Wilco Dijkstra" <Wilco_dot_Dijkstra@ntlworld.com> replied
| Quote: | Comparing against the VC++ allocator is not that interesting - it is one of
the slowest allocators out there, even the single threaded version.
|
A good 'lowest common denominator' allocator to compare against (at least
for single threaded performance) is Doug Lea's allocator. A google
search for 'Lea allocator' will turn up it and various academic papers
that talk about other allocation strategies, since Lea's allocator is
often used as a baseline.
Chris Dodd
cdodd@acm.org |
|
| Back to top |
|
 |
David Schwartz
Guest
|
Posted:
Thu Oct 20, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Eric Northup" <digitaleric@gmail.com> wrote in message
news:1129782239.992451.95820@g43g2000cwa.googlegroups.com...
| Quote: | Aligning up to a power-of-2 and then putting all the padding at the end
can cause pathological cache behavior on some architectures. See
section 4.1 ("Impact of Buffer Address Distribution on Cache
Utilization") of Jeff Bonwick's 94 USENIX paper "The Slab Allocator: An
Object-Caching Kernel Memory Allocator".
|
There are some incredibly simple ways to fix this, but first you must
realize that it can be a problem.
DS |
|
| Back to top |
|
 |
Guest
|
Posted:
Thu Oct 20, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Oliver S. wrote:
| Quote: | Heavens no. Why write some software yourself when you can wait for
someone else to write it, and then threaten to sue them for patent
infringement? As long as you pick companies to shake down that are
profitable but not big enough to contest a lawsuit, you'll make out
like bandits.
I think that the architecture of my allocator is that special, that
it's extremely unlikely that will someone will come up with the same
idea.
|
Someone once wrote a math PhD on the probability that someone else was
writing the same PhD. In general, the probability is high... |
|
| Back to top |
|
 |
Terje Mathisen
Guest
|
Posted:
Thu Oct 20, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Oliver S. wrote:
| Quote: | Heavens no. Why write some software yourself when you can wait for
someone else to write it, and then threaten to sue them for patent
infringement? As long as you pick companies to shake down that are
profitable but not big enough to contest a lawsuit, you'll make out
like bandits.
I think that the architecture of my allocator is that special, that
it's extremely unlikely that will someone will come up with the same
idea.
|
OK, in that case I'll bet you a beer (even though I don't drink the
stuff myself) that at least one, more probably several, of the regulars
here have already written something quite similar, and done it decades ago.
Try reading the Wheeler posts!
Yes, the micro details of your approach might even be unique, but even
that is in doubt, imho.
Terje
--
- <Terje.Mathisen@hda.hydro.com>
"almost all programming can be viewed as an exercise in caching" |
|
| Back to top |
|
 |
Wilco Dijkstra
Guest
|
Posted:
Thu Oct 20, 2005 4:15 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Oliver S." <Follow.Me@gmx.net> wrote in message
news:dj6o4m$drn$1@domitilla.aioe.org...
| Quote: | 25% fragmentation sounds like rounding up block sizes to a power
of 2 - not a good idea.
That's excactly what I'm doing. I think that's quite ok on desktop- and
even more for server-systems for performance-critical applications. Other
apps can stich with less aggressivee allocators. But this round-up occurs
"only" on blocks up to one page and the small-block pool can only grow up
to a predefined limit whic is 64MB by default.
|
Even up to a page you can get severe fragmentation for allocations that are
just larger than a power of 2. Note the assumption that block sizes have a
flat distribution is flawed, applications typically use only a few fixed
sizes
which could mean you're often seeing the worst possible case. And if you
use the buddy system then you also have a lot of external fragmentation.
| Quote: | Allowing a lot of fragmentation is bad as performance drops due to less
locality and more TLB/page misses.
The raise on TLB misses shouldn't be woth to be mentioned with an average
internal block-fragmentation of 25%.
|
25% fragmentation means 33% more memory, so you should see an
increase in TLB and page misses. Using more memory than you need
is always a cost, so it is worth spending a little extra time to reduce
fragmentation. This is why just using micro benchmarks is a bad idea:
they reward allocators that are efficient in CPU time but inefficient in
memory occupation. Real applications benefit more from allocators
that increase locality and reduce page and TLB misses.
| Quote: | This effect can be significant in my experience, ...
I'll bet my right hand that the raise in TLB misses isn't noteworthy in
terms of performance-reduction. And the tail-padding of each-block that
causes an internal fragmentation in terms of unused cacheline-storage of
the cache is also very small.
|
CPUs still don't have that many TLB entries, I recall the initial Athlon64
increased it to 40 for its 64KByte L1 cache - so just 160KBytes of
memory can be addressed without incurring a TLB miss... TLB misses
are often underestimated, you can easily get many TLB misses with
so few entries despite hitting the L1 cache every time.
| Quote: | It's why STL and new are big no-no's in my book, ...
Most applications aren't performance-crucial; so you can't make
a generalized no-no from this insight on std::string-performance.
|
I don't agree - it's precisely this kind of attitude that has resulted in
all the sluggish bloatware. You can't tell when something is performance
critical or not - basically anything can become a bottleneck if it is slowed
down by 2-3 orders of magnitude (I prefer seeing the characters I'm typing
immediately etc).
As a real-life example, a compiler I worked on suddenly started to have
noticeable startup delays and build time increased by more than 50%.
After some investigation it turned out that a graduate rewrote the
command-line parser using brain-dead STL causing it to be 250 times
slower... He then spent another few weeks trying to optimise the code
and got it close to 100 times as slow :-)
I had a look and proved it couldn't ever be better than 50 times as
slow due to STL. When I told him to rewrite it in C, we ended up with
a 10 times speedup over the original simplistic parser.
It's difficult to imagine how STL could be useful given the enormous
overheads. A linked list entry uses 40 bytes in VC++, a simplified
version using my allocator uses just 8 bytes (including allocation
overheads - it's never necessary to store any meta data). You can
make up your own mind, but I prefer to avoid nasty surprises like
above at a late stage in a project!
Wilco |
|
| Back to top |
|
 |
Peter Dimov
Guest
|
Posted:
Thu Oct 20, 2005 4:15 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Wilco Dijkstra wrote:
| Quote: | String
concatenation using STL is around 30-50 times slower than the equivalent
C for example - mostly due to the overhead of new/delete.
|
This is off-topic in all three groups, but I just can't let it slide.
If by "equivalent C" you mean in-place strcat into a fixed size buffer,
then no, the equivalent std::string operation (s1 += s2, where s1 has
enough capacity to hold the result) does neither new nor delete and is
nowhere near 30-50 times slower. |
|
| Back to top |
|
 |
|
|
|
|