By: rwessel (rwessel.delete@this.yahoo.com), July 2, 2022 7:16 am
Room: Moderated Discussions
Adrian (a.delete@this.acm.org) on July 1, 2022 11:00 pm wrote:
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on July 1, 2022 4:01 am wrote:
> >
> > The worst-case scenario for the memory allocator you are describing IS NOT when first allocating
> > 1_000_000 objects of the same size and then deallocating 999_999 of those objects. The reason
> > for that is that the pattern remaining after deallocation of the 999_999 objects - which
> > leaves only 1 object allocated - has a very low descriptive complexity.
> >
> > The worst-case scenario is: allocate 1_000_000 objects and then use an RNG (random number generator) to
> > deallocate 500_000 randomly selected objects. The reason why this is the worst-case scenario is that the
> > resulting sequence of 0 and 1 (zeroes and ones, or it can
> > be interpreted as boolean values) is incompressible
> > (assuming we want to avoid re-running the RNG every time a malloc() or free() is called to serve additional
> > allocation/deallocation requests after the 500_000 deallocations finish executing).
>
>
> No, maybe I have not explained clearly, but this case, with 1 million allocated objects then half of
> them, randomly chosen, being deallocated, is not the worst case. Moreover, it is not a bad case at
> all, but one in which such a memory allocator behaves perfectly, with no performance degradation.
>
> The case described by you is precisely the one that is the worst for a standard
> malloc implementation, being the most likely to cause heap fragmentation.
>
> On the other hand, when the heap is implemented as a set of stack containing the free memory blocks,
> each stack (implemented as a linked list) being dedicated to provide memory blocks of a fixed size,
> which are used to satisfy memory requests between 2 thresholds, e.g. 48-byte memory blocks for satisfying
> all the malloc calls with sizes from 33 to 48 bytes, then it becomes completely irrelevant in which
> order the "free" calls are done and which happened to be the freed allocations.
>
> The case described for you is bad only when the malloc and free calls attempt to resize the memory blocks, by
> dividing a larger block to better match the requested malloc size, or attempting to coalesce the freed block
> with another. Both these kinds of resizing attempts require searches through the heap, searches that may be unsuccessful
> and whose chances of failure increase with the number of random "free" calls that have happened.
>
> In contrast, when the heap is implemented with stacks, there are no searches through
> the heap and there is only one possible failure mode for "malloc": stack empty for
> the requested size. That is solved by requesting one or more memory pages from the
> OS, making free memory blocks out of them and pushing them in the empty stack.
>
>
>
>
>
> >
> > Basically, you are claiming that if a program has for example reached a point/state
> > in which it has the following counts and sizes of memory allocated (COUNT*SIZE):
> >
> > 4*AA 3*BBBB 1*CCCCCCCC
> >
> > then the memory layout (=L1):
> >
> > AA .... AA .... AA .... AA ....
> >
> > BBBB .... BBBB .... BBBB ....
> >
> > CCCCCCCC ....
> >
> > somehow results in lower _total_ memory consumption compared to a memory layout (=L2) such as:
> >
> > AA .... BBBB .... AA .... BBBB .... AA .... CCCCCCCC .... BBBB .... AA ....
> >
> > There exist many "...." regions in both cases. So, please post a sufficiently thorough explanation
> > why you believe that the combined/total memory usage of those "...." is lower in the memory
> > allocator layout L1 you are defending here and is higher in allocator L2.
> >
> > The main advantage of L1 over L2 isn't lower memory fragmentation - the main advantage
> > is that L1 can fulfill a request in O(1) time, but L2 requires O(log(N)) time.
> >
> > -atom
>
>
> First of all I have not defended the malloc implementation where the heap is implemented as a set
> of stacks in general, I have repeatedly stressed that this style of allocator is the best for long-lived
> processes, i.e. daemons/services, where it guarantees no performance degradation even after years
> of uptime, but it definitely is not the best kind of malloc implementation in other cases.
>
>
> There is no doubt that the total memory usage is worse for the
> stack-based allocator and this is its main disadvantage.
>
> The higher memory usage is mainly due to the allocated blocks being larger than the requested sizes in
> many cases, though this can be tuned for specific applications by choosing appropriate thresholds for
> the sizes so that the most frequent allocations will not need overhead. The size of each memory block
> is also increased by one pointer needed to allow linking into the stack, but that pointer can be used
> to point to a descriptor while the block is allocated, which can contain other information, like the
> size, which is also needed by most other malloc algorithms, so that may mean just an extra 4 bytes versus
> the other malloc algorithms, i.e. not all 8 bytes of the pointer are additional overhead.
>
>
> The "malloc" and "free" times for the stack-based allocator are almost constant for most invocations, i.e.
> almost O(1), consisting of making a push or a pop in a stack (i.e. only one link manipulation), and the
> stack that must be used is found in constant time by one LZCNT instruction when there is one stack per octave
> of malloc requested size. An extra stack per octave needs just one more bit test instruction.
>
> A "malloc" can occasionally request memory from the OS, but this becomes more and more rare after startup. "free"
> (or "malloc" or both, depending on an implementation choice) might need very seldomly a long time, only if the
> allocator is written to make attempts to detect the case when no allocation requests have been made in a long
> time with certain sizes and there is much memory in the stacks that could be returned to the OS.
>
>
> So regarding the execution times, no other kind of "malloc" can do better than this, therefore the choice
> between others and the stack-based heap implementation is mainly a choice between efficient use of heap memory
> versus guarantees of no performance degradation in time together with optimal times for malloc/free.
>
One problem with that sort of allocator is that you still "fragment" real/physical memory (and virtual address space, although that less of an issue on 64-bit machines), since pages can't generally be returned to the OS. It's rare, and relatively hard* to detect, that all entries on a page are empty, so that the page itself can be freed. That's not a problem if the usage of objects of size X is relatively constant over time, but once you allocate enough pages for a million of these objects, you're probably going to be stuck with a million objects worth of pages - and real/physical memory - until the program ends, even if you free most of those individual objects.
*In terms of overhead - it certainly can be done.
> ⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on July 1, 2022 4:01 am wrote:
> >
> > The worst-case scenario for the memory allocator you are describing IS NOT when first allocating
> > 1_000_000 objects of the same size and then deallocating 999_999 of those objects. The reason
> > for that is that the pattern remaining after deallocation of the 999_999 objects - which
> > leaves only 1 object allocated - has a very low descriptive complexity.
> >
> > The worst-case scenario is: allocate 1_000_000 objects and then use an RNG (random number generator) to
> > deallocate 500_000 randomly selected objects. The reason why this is the worst-case scenario is that the
> > resulting sequence of 0 and 1 (zeroes and ones, or it can
> > be interpreted as boolean values) is incompressible
> > (assuming we want to avoid re-running the RNG every time a malloc() or free() is called to serve additional
> > allocation/deallocation requests after the 500_000 deallocations finish executing).
>
>
> No, maybe I have not explained clearly, but this case, with 1 million allocated objects then half of
> them, randomly chosen, being deallocated, is not the worst case. Moreover, it is not a bad case at
> all, but one in which such a memory allocator behaves perfectly, with no performance degradation.
>
> The case described by you is precisely the one that is the worst for a standard
> malloc implementation, being the most likely to cause heap fragmentation.
>
> On the other hand, when the heap is implemented as a set of stack containing the free memory blocks,
> each stack (implemented as a linked list) being dedicated to provide memory blocks of a fixed size,
> which are used to satisfy memory requests between 2 thresholds, e.g. 48-byte memory blocks for satisfying
> all the malloc calls with sizes from 33 to 48 bytes, then it becomes completely irrelevant in which
> order the "free" calls are done and which happened to be the freed allocations.
>
> The case described for you is bad only when the malloc and free calls attempt to resize the memory blocks, by
> dividing a larger block to better match the requested malloc size, or attempting to coalesce the freed block
> with another. Both these kinds of resizing attempts require searches through the heap, searches that may be unsuccessful
> and whose chances of failure increase with the number of random "free" calls that have happened.
>
> In contrast, when the heap is implemented with stacks, there are no searches through
> the heap and there is only one possible failure mode for "malloc": stack empty for
> the requested size. That is solved by requesting one or more memory pages from the
> OS, making free memory blocks out of them and pushing them in the empty stack.
>
>
>
>
>
> >
> > Basically, you are claiming that if a program has for example reached a point/state
> > in which it has the following counts and sizes of memory allocated (COUNT*SIZE):
> >
> > 4*AA 3*BBBB 1*CCCCCCCC
> >
> > then the memory layout (=L1):
> >
> > AA .... AA .... AA .... AA ....
> >
> > BBBB .... BBBB .... BBBB ....
> >
> > CCCCCCCC ....
> >
> > somehow results in lower _total_ memory consumption compared to a memory layout (=L2) such as:
> >
> > AA .... BBBB .... AA .... BBBB .... AA .... CCCCCCCC .... BBBB .... AA ....
> >
> > There exist many "...." regions in both cases. So, please post a sufficiently thorough explanation
> > why you believe that the combined/total memory usage of those "...." is lower in the memory
> > allocator layout L1 you are defending here and is higher in allocator L2.
> >
> > The main advantage of L1 over L2 isn't lower memory fragmentation - the main advantage
> > is that L1 can fulfill a request in O(1) time, but L2 requires O(log(N)) time.
> >
> > -atom
>
>
> First of all I have not defended the malloc implementation where the heap is implemented as a set
> of stacks in general, I have repeatedly stressed that this style of allocator is the best for long-lived
> processes, i.e. daemons/services, where it guarantees no performance degradation even after years
> of uptime, but it definitely is not the best kind of malloc implementation in other cases.
>
>
> There is no doubt that the total memory usage is worse for the
> stack-based allocator and this is its main disadvantage.
>
> The higher memory usage is mainly due to the allocated blocks being larger than the requested sizes in
> many cases, though this can be tuned for specific applications by choosing appropriate thresholds for
> the sizes so that the most frequent allocations will not need overhead. The size of each memory block
> is also increased by one pointer needed to allow linking into the stack, but that pointer can be used
> to point to a descriptor while the block is allocated, which can contain other information, like the
> size, which is also needed by most other malloc algorithms, so that may mean just an extra 4 bytes versus
> the other malloc algorithms, i.e. not all 8 bytes of the pointer are additional overhead.
>
>
> The "malloc" and "free" times for the stack-based allocator are almost constant for most invocations, i.e.
> almost O(1), consisting of making a push or a pop in a stack (i.e. only one link manipulation), and the
> stack that must be used is found in constant time by one LZCNT instruction when there is one stack per octave
> of malloc requested size. An extra stack per octave needs just one more bit test instruction.
>
> A "malloc" can occasionally request memory from the OS, but this becomes more and more rare after startup. "free"
> (or "malloc" or both, depending on an implementation choice) might need very seldomly a long time, only if the
> allocator is written to make attempts to detect the case when no allocation requests have been made in a long
> time with certain sizes and there is much memory in the stacks that could be returned to the OS.
>
>
> So regarding the execution times, no other kind of "malloc" can do better than this, therefore the choice
> between others and the stack-based heap implementation is mainly a choice between efficient use of heap memory
> versus guarantees of no performance degradation in time together with optimal times for malloc/free.
>
One problem with that sort of allocator is that you still "fragment" real/physical memory (and virtual address space, although that less of an issue on 64-bit machines), since pages can't generally be returned to the OS. It's rare, and relatively hard* to detect, that all entries on a page are empty, so that the page itself can be freed. That's not a problem if the usage of objects of size X is relatively constant over time, but once you allocate enough pages for a million of these objects, you're probably going to be stuck with a million objects worth of pages - and real/physical memory - until the program ends, even if you free most of those individual objects.
*In terms of overhead - it certainly can be done.