| Author |
Message |
Wilco Dijkstra
Guest
|
Posted:
Sat Oct 22, 2005 12:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Oliver S." <Follow.Me@gmx.net> wrote in message
news:dj9dlt$mdp$1@domitilla.aioe.org...
| Quote: | Note the assumption that block sizes have a flat distribution is flawed,
I didn't assume that.
|
So how did you calculate 25% fragmentation?
| Quote: | 25% fragmentation means 33% more memory, ...
So what! Performance is often bought with memory-consumption.
|
That's incorrect in general. In special cases small lookup tables can indeed
improve performance by avoiding other penalties such as unpredictable
branches, however programs with large datasets (over say 50MBytes - ie.
many Windows programs) will slow down significantly due to unnecessary
memory consumption. In these cases it is essential to spend extra CPU
cycles on creating compact data structures rather than waste many more
cycles on cache/TLB/page misses. This includes the memory allocator
*investing* CPU cycles into reducing fragmentation and improving locality.
| Quote: | And I'm
not really sure that the fragmentation of my allocator is that worse.
With usual allocators there might be a lot of blocks which aren't usable
because they don't meet any allocation-requests.
|
All allocators have external fragmentation, but some are worse than others
due to the allocation and coalescing policies. How much does your allocator
have? When do you coalesce? You're not using the buddy system I hope?
<snipped stuff about large pages>
Thanks, I'll try running the large page benchmark. They can help but are
not a panacea as they create a new set of problems.
| Quote: | Using more memory than you need is always a cost, ...
We're talking about small blocks up to a page here! That's not the type
of cost with isn't worth to be mentioned on the applications my allocator
is targeted for.
|
It's matters precisely because they are small blocks. Average allocation
size in real applications is surprisingly small. Many allocations are small
structures, few are related to the size of the dataset and so average
blocksize grows slowly with total size. Conversely wastage often grows
faster than the dataset and increases over time.
Try opening&closing a large file (> 100 MBytes) in an editor and observe
memory use. Or run large programs for months on end...
| Quote: | ... so it is worth spending a little extra time to reduce fragmentation.
Why are you so sure that the internal fragementation I have (which is only
noteworthy in terms of cacheline tail-blend) is worse than the external
frag-
mentation of other allocators? Have you seen any invesigations on this?
|
Because you will also have external fragmentation like other allocators.
Are you saying yours has less external fragmentation?!?
Wilco |
|
| Back to top |
|
 |
David Hopwood
Guest
|
Posted:
Sat Oct 22, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Andy Glew wrote:
| Quote: | Eric Northup wrote:
Aligning up to a power-of-2 and then putting all the padding at the end
can cause pathological cache behavior on some architectures.
David Hopwood <david.nospam.hopwood@blueyonder.co.uk> writes:
This is the result of the hash function that maps an address to a cache
line being too simple, depending only on some bits of the address. But
it is quite trivial to make the hash function depend on all bits of the
address, at very small hardware cost. So why is it not considered the
responsibility of hardware designers to get this right, rather than
the responsibility of software to work around poorly designed caches?
I tend to agree with you, except that I am reasonably certain that
incorporating many more bits into the hash function for the closest in
or L1 cache would slow the L1 cache down dramatically - e.g. from 3
cycles to 4, or from 4 to 5.
* yes, just XORing in more address bits *will* cost that much more time
|
I will take your word for it. But I am surprised.
Does anyone know which common processors use only low bits of an address
(either physical or virtual) to determine the L1 cache line?
| Quote: | I think that it is more reasonable to consider more complete index
functions for the L2 cache accesses and further out.
|
Right.
--
David Hopwood <david.nospam.hopwood@blueyonder.co.uk> |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Sat Oct 22, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | * yes, just XORing in more address bits *will* cost that much more time
|
Sure that this will heighten a critical path so that this will cost an
additional clock-cycle (although I believe that XORing won't help at all)?
I think that it's very likely that this won't do that! |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Sat Oct 22, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | Note the assumption that block sizes have a flat distribution is flawed,
I didn't assume that.
So how did you calculate 25% fragmentation?
|
I did an assumption of a flat distribution in the different size-classes
and not an overall flat distribution. So I'm right when I object, that I
didn't assume a flat distribution because it's more likely that you meant
a flat overall-distribution.
| Quote: | So what! Performance is often bought with memory-consumption.
That's incorrect in general. In special cases small lookup tables can indeed
improve performance by avoiding other penalties such as unpredictable
branches, however programs with large datasets (over say 50MBytes - ie.
many Windows programs) will slow down significantly due to unnecessary
memory consumption. In these cases it is essential to spend extra CPU
cycles on creating compact data structures rather than waste many more
cycles on cache/TLB/page misses. This includes the memory allocator
*investing* CPU cycles into reducing fragmentation and improving locality.
And I'm not really sure that the fragmentation of my allocator is that
worse. With usual allocators there might be a lot of blocks which aren't
usable because they don't meet any allocation-requests.
All allocators have external fragmentation, ...
|
Yes, but in my case, this external fragmentation doesn't lead to blocks
that may stay unused because they're likely to not satisfy any allocation.
| Quote: | How much does your allocator have? When do you coalesce?
|
In my case, single pages have all the same block-size for small-blocks.
So there's no need for coalescion.
| Quote: | Thanks, I'll try running the large page benchmark. They can
help but are not a panacea as they create a new set of problems.
|
It's rather hard for the kernel to efficiently satisfy these large pages
by re-arraging the pages so that contignous and aligned regions in the
size of a large page are available. But this seems to work at least with
my OS. And I have a nice idea on this: A background thead with idle-pri-
ority of the kernel might do this re-arrangements (of course by doing
some priority-inversions after locking the appropriate kernel data-struc-
tures *g*).
| Quote: | We're talking about small blocks up to a page here! That's not the type
of cost with isn't worth to be mentioned on the applications my allocator
is targeted for.
It's matters precisely because they are small blocks. Average allocation
size in real applications is surprisingly small. Many allocations are small
structures, few are related to the size of the dataset and so average
blocksize grows slowly with total size. Conversely wastage often grows
faster than the dataset and increases over time.
|
You might be right on this; but that doesn't matter with actual memory-costs.
To end this discussion about peformance degradations due to internal fragmen-
tation, uneven distribution in cache-usage and so on, I had a simple idea on
how to incorporate this "problems" into a micro-benchmark. I'll simpy do a
worst-case test on this with very distant malloc()-/free()-pairs at random
sizes (up to a certain size, i.e. 128 bytes) and I'll do a std::memset on
each allocated block after allocation and thereby provoking all this effects
to the largest extent. I'm pretty sure that after all my allocator remains
more performant than most and maybe all others. |
|
| Back to top |
|
 |
Bill Todd
Guest
|
Posted:
Sat Oct 22, 2005 8:15 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Oliver S. wrote:
| Quote: | * yes, just XORing in more address bits *will* cost that much more time
Sure that this will heighten a critical path so that this will cost an
additional clock-cycle (although I believe that XORing won't help at all)?
I think that it's very likely that this won't do that!
|
Such uninformed opinion, plus a buck or so, may buy you a cup of coffee,
but not much else.
Andy designs processors for a living, and was a central architect for
the P6. What are your qualifications?
- bill |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Sat Oct 22, 2005 12:14 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | I'm currently not sure whether I should make this open-source without any
limita-
tions, patent the algorithms and leave it open-source and non-commercial
for non
-profit projects (I've got a date with a patent-attorney on thursday) or
get it a
completely commercial close-source component without any patented
algorithms.
|
Better check if any of the algorithms you are using are not already
patented!
;)
Do a patent feasibility study.
P.S.
It took me a long time to research and finally patent VZOOM; patent activity
in that area seems to be high. |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Sat Oct 22, 2005 12:30 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | Oliver S. wrote:
To end this discussion about peformance degradations due to internal
fragmen-
tation, uneven distribution in cache-usage and so on, I had a simple idea
on
how to incorporate this "problems" into a micro-benchmark. I'll simpy do a
worst-case test on this with very distant malloc()-/free()-pairs at random
sizes (up to a certain size, i.e. 128 bytes) and I'll do a std::memset on
each allocated block after allocation and thereby provoking all this
effects
to the largest extent. I'm pretty sure that after all my allocator remains
more performant than most and maybe all others.
|
I would suggest benchmarking against:
http://www.leapheap.com/
I also have a design that might do well against your implementation. It's
kind of based on this simple slab-allocator pseudo-code I posted here:
http://groups.google.com/group/comp.programming.threads/browse_thread/thread/8245f4b48591fc69/e3efa5628aad4a82?hl=en#e3efa5628aad4a82
This works when you apply the little bug fixes listed. |
|
| Back to top |
|
 |
Casper H.S. Dik
Guest
|
Posted:
Sat Oct 22, 2005 2:16 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Oliver S." <Follow.Me@gmx.net> writes:
| Quote: | To end this discussion about peformance degradations due to internal fragmen-
tation, uneven distribution in cache-usage and so on, I had a simple idea on
how to incorporate this "problems" into a micro-benchmark. I'll simpy do a
worst-case test on this with very distant malloc()-/free()-pairs at random
sizes (up to a certain size, i.e. 128 bytes) and I'll do a std::memset on
each allocated block after allocation and thereby provoking all this effects
to the largest extent. I'm pretty sure that after all my allocator remains
more performant than most and maybe all others.
|
On some systems, memset() might avoid poluting the cache and just spend
memory cycles; on other it will pollute the cache.
I would strongly suggest that you benchmark you allocator against others;
in my experience, people have an extremely poor intuition when it comes
to performance.
Casper
--
Expressed in this posting are my opinions. They are in no way related
to opinions held by my employer, Sun Microsystems.
Statements on Sun products included here are not gospel and may
be fiction rather than truth. |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Sat Oct 22, 2005 3:37 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | On some systems, memset() might avoid poluting the cache and
just spend memory cycles; on other it will pollute the cache.
|
Of course; but it's only a micro benchmark which will be run on
the approproate the OS. |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Sat Oct 22, 2005 3:39 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
LeapHeap is a plainly idiotic allocator.
| Quote: | I also have a design that might do well against your implementation.
It's kind of based on this simple slab-allocator pseudo-code I posted
here: http://groups.google.com/group/...
|
O what a tiny allocator. My code is about 1.600 lines with a lot of
inline member-functions. |
|
| Back to top |
|
 |
Andreas Krey
Guest
|
Posted:
Sat Oct 22, 2005 4:15 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
* Oliver 'I don't do attributions' S. (Follow.Me@gmx.net)
.....
| Quote: | O what a tiny allocator. My code is about 1.600 lines with a lot of
inline member-functions.
|
Which, of course, is per se not any indication of quality. Care to publish?
Andreas
--
np: 4'33 |
|
| Back to top |
|
 |
Oliver S.
Guest
|
Posted:
Sat Oct 22, 2005 4:15 pm Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | O what a tiny allocator. My code is about 1.600 lines with a lot of
inline member-functions.
Which, of course, is per se not any indication of quality.
|
Of course not.
Yes, I've reserved the domain mallocator.org for that. |
|
| Back to top |
|
 |
Chris Thomasson
Guest
|
Posted:
Sun Oct 23, 2005 4:58 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
| Quote: | Casper H.S. Dik wrote:
"Oliver S." <Follow.Me@gmx.net> writes:
I've developed a extremly performant general-purpose memory allocator
(it runs
currently under windows, but should be adapted for Unices in an hour).
I'm very
confident it will outperform HOARD and many others both in
singlethreaded as well
Hoard comes with a number of tests you could run but you should also
look at a number of other designs. Hoard certainly isn't
state-of-the-art.
Really? I'd like to hear of another scalable memory allocator that's
not only portable and fast but also avoids both memory blowup and
allocator-induced false sharing. The only ones I'm aware of that have
at least three of those characteristics (except portable) are
_derivatives of Hoard_: the mostly lock-free allocator by Dice &
Garthwaite (ISMM 02) which relies on OS support, and the completely
lock-free variant by Michael (PLDI 04). None are publicly available.
|
Humm... Do you mean there are patents out on them?
http://www.research.ibm.com/people/m/michael/pldi-2004.pdf
AFAICT, Hoard requires at least two atomic operations and storeload membars
for every "uncontended" allocation? IMHO, this is unnecessary overhead... |
|
| Back to top |
|
 |
Guest
|
Posted:
Sun Oct 23, 2005 5:22 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
Casper H.S. Dik wrote:
| Quote: | "Oliver S." <Follow.Me@gmx.net> writes:
I've developed a extremly performant general-purpose memory allocator (it runs
currently under windows, but should be adapted for Unices in an hour). I'm very
confident it will outperform HOARD and many others both in singlethreaded as well
Hoard comes with a number of tests you could run but you should also
look at a number of other designs. Hoard certainly isn't state-of-the-art.
|
Really? I'd like to hear of another scalable memory allocator that's
not only portable and fast but also avoids both memory blowup and
allocator-induced false sharing. The only ones I'm aware of that have
at least three of those characteristics (except portable) are
_derivatives of Hoard_: the mostly lock-free allocator by Dice &
Garthwaite (ISMM 02) which relies on OS support, and the completely
lock-free variant by Michael (PLDI 04). None are publicly available. I
gather that the Google allocator is pretty good, but to my knowledge,
it doesn't run on Windows.
It's actually tremendously easy to make a fast allocator that lacks one
of the characteristics mentioned above (especially the blowup [1] and
false sharing avoidance [2]). The challenge is addressing all of them
simultaneously.
regards,
-- emery (author of Hoard)
[1] Blowup is a special kind of excessive memory consumption that's
triggered by allocators that can't recycle all memory used across
threads.
[2] Allocators can induce false sharing by parceling out pieces of a
cache line to different processors, or allowing pieces to eventually be
split across processors.
--
Emery Berger
Assistant Professor
Dept. of Computer Science
University of Massachusetts, Amherst
www.cs.umass.edu/~emery |
|
| Back to top |
|
 |
Finn Schiermer Andersen
Guest
|
Posted:
Sun Oct 23, 2005 6:35 am Post subject:
Re: Micro-Benchmark for Memory-Allocators |
|
|
"Oliver S." <Follow.Me@gmx.net> writes:
| Quote: | Andy designs processors for a living, and was a central
architect for the P6.
Even experts sometimes say things that aren't 100% correct.
|
But this time, it most likely is. Primary cache lookups
are almost always on the critical path in microprocesser
designs. There are a lot of articles describing microprocessors
which can confirm that.
And my qualification: designing RISC microprocessors for
MIPS Technologies some years ago.
/Finn |
|
| Back to top |
|
 |
|
|
|
|