By: Adrian (a.delete@this.acm.org), July 1, 2022 11:00 pm
Room: Moderated Discussions
⚛ (0xe2.0x9a.0x9b.delete@this.gmail.com) on July 1, 2022 4:01 am wrote:
>
> The worst-case scenario for the memory allocator you are describing IS NOT when first allocating
> 1_000_000 objects of the same size and then deallocating 999_999 of those objects. The reason
> for that is that the pattern remaining after deallocation of the 999_999 objects - which
> leaves only 1 object allocated - has a very low descriptive complexity.
>
> The worst-case scenario is: allocate 1_000_000 objects and then use an RNG (random number generator) to
> deallocate 500_000 randomly selected objects. The reason why this is the worst-case scenario is that the
> resulting sequence of 0 and 1 (zeroes and ones, or it can be interpreted as boolean values) is incompressible
> (assuming we want to avoid re-running the RNG every time a malloc() or free() is called to serve additional
> allocation/deallocation requests after the 500_000 deallocations finish executing).
No, maybe I have not explained clearly, but this case, with 1 million allocated objects then half of them, randomly chosen, being deallocated, is not the worst case. Moreover, it is not a bad case at all, but one in which such a memory allocator behaves perfectly, with no performance degradation.
The case described by you is precisely the one that is the worst for a standard malloc implementation, being the most likely to cause heap fragmentation.
On the other hand, when the heap is implemented as a set of stack containing the free memory blocks, each stack (implemented as a linked list) being dedicated to provide memory blocks of a fixed size, which are used to satisfy memory requests between 2 thresholds, e.g. 48-byte memory blocks for satisfying all the malloc calls with sizes from 33 to 48 bytes, then it becomes completely irrelevant in which order the "free" calls are done and which happened to be the freed allocations.
The case described for you is bad only when the malloc and free calls attempt to resize the memory blocks, by dividing a larger block to better match the requested malloc size, or attempting to coalesce the freed block with another. Both these kinds of resizing attempts require searches through the heap, searches that may be unsuccessful and whose chances of failure increase with the number of random "free" calls that have happened.
In contrast, when the heap is implemented with stacks, there are no searches through the heap and there is only one possible failure mode for "malloc": stack empty for the requested size. That is solved by requesting one or more memory pages from the OS, making free memory blocks out of them and pushing them in the empty stack.
>
> Basically, you are claiming that if a program has for example reached a point/state
> in which it has the following counts and sizes of memory allocated (COUNT*SIZE):
>
> 4*AA 3*BBBB 1*CCCCCCCC
>
> then the memory layout (=L1):
>
> AA .... AA .... AA .... AA ....
>
> BBBB .... BBBB .... BBBB ....
>
> CCCCCCCC ....
>
> somehow results in lower _total_ memory consumption compared to a memory layout (=L2) such as:
>
> AA .... BBBB .... AA .... BBBB .... AA .... CCCCCCCC .... BBBB .... AA ....
>
> There exist many "...." regions in both cases. So, please post a sufficiently thorough explanation
> why you believe that the combined/total memory usage of those "...." is lower in the memory
> allocator layout L1 you are defending here and is higher in allocator L2.
>
> The main advantage of L1 over L2 isn't lower memory fragmentation - the main advantage
> is that L1 can fulfill a request in O(1) time, but L2 requires O(log(N)) time.
>
> -atom
First of all I have not defended the malloc implementation where the heap is implemented as a set of stacks in general, I have repeatedly stressed that this style of allocator is the best for long-lived processes, i.e. daemons/services, where it guarantees no performance degradation even after years of uptime, but it definitely is not the best kind of malloc implementation in other cases.
There is no doubt that the total memory usage is worse for the stack-based allocator and this is its main disadvantage.
The higher memory usage is mainly due to the allocated blocks being larger than the requested sizes in many cases, though this can be tuned for specific applications by choosing appropriate thresholds for the sizes so that the most frequent allocations will not need overhead. The size of each memory block is also increased by one pointer needed to allow linking into the stack, but that pointer can be used to point to a descriptor while the block is allocated, which can contain other information, like the size, which is also needed by most other malloc algorithms, so that may mean just an extra 4 bytes versus the other malloc algorithms, i.e. not all 8 bytes of the pointer are additional overhead.
The "malloc" and "free" times for the stack-based allocator are almost constant for most invocations, i.e. almost O(1), consisting of making a push or a pop in a stack (i.e. only one link manipulation), and the stack that must be used is found in constant time by one LZCNT instruction when there is one stack per octave of malloc requested size. An extra stack per octave needs just one more bit test instruction.
A "malloc" can occasionally request memory from the OS, but this becomes more and more rare after startup. "free" (or "malloc" or both, depending on an implementation choice) might need very seldomly a long time, only if the allocator is written to make attempts to detect the case when no allocation requests have been made in a long time with certain sizes and there is much memory in the stacks that could be returned to the OS.
So regarding the execution times, no other kind of "malloc" can do better than this, therefore the choice between others and the stack-based heap implementation is mainly a choice between efficient use of heap memory versus guarantees of no performance degradation in time together with optimal times for malloc/free.
>
> The worst-case scenario for the memory allocator you are describing IS NOT when first allocating
> 1_000_000 objects of the same size and then deallocating 999_999 of those objects. The reason
> for that is that the pattern remaining after deallocation of the 999_999 objects - which
> leaves only 1 object allocated - has a very low descriptive complexity.
>
> The worst-case scenario is: allocate 1_000_000 objects and then use an RNG (random number generator) to
> deallocate 500_000 randomly selected objects. The reason why this is the worst-case scenario is that the
> resulting sequence of 0 and 1 (zeroes and ones, or it can be interpreted as boolean values) is incompressible
> (assuming we want to avoid re-running the RNG every time a malloc() or free() is called to serve additional
> allocation/deallocation requests after the 500_000 deallocations finish executing).
No, maybe I have not explained clearly, but this case, with 1 million allocated objects then half of them, randomly chosen, being deallocated, is not the worst case. Moreover, it is not a bad case at all, but one in which such a memory allocator behaves perfectly, with no performance degradation.
The case described by you is precisely the one that is the worst for a standard malloc implementation, being the most likely to cause heap fragmentation.
On the other hand, when the heap is implemented as a set of stack containing the free memory blocks, each stack (implemented as a linked list) being dedicated to provide memory blocks of a fixed size, which are used to satisfy memory requests between 2 thresholds, e.g. 48-byte memory blocks for satisfying all the malloc calls with sizes from 33 to 48 bytes, then it becomes completely irrelevant in which order the "free" calls are done and which happened to be the freed allocations.
The case described for you is bad only when the malloc and free calls attempt to resize the memory blocks, by dividing a larger block to better match the requested malloc size, or attempting to coalesce the freed block with another. Both these kinds of resizing attempts require searches through the heap, searches that may be unsuccessful and whose chances of failure increase with the number of random "free" calls that have happened.
In contrast, when the heap is implemented with stacks, there are no searches through the heap and there is only one possible failure mode for "malloc": stack empty for the requested size. That is solved by requesting one or more memory pages from the OS, making free memory blocks out of them and pushing them in the empty stack.
>
> Basically, you are claiming that if a program has for example reached a point/state
> in which it has the following counts and sizes of memory allocated (COUNT*SIZE):
>
> 4*AA 3*BBBB 1*CCCCCCCC
>
> then the memory layout (=L1):
>
> AA .... AA .... AA .... AA ....
>
> BBBB .... BBBB .... BBBB ....
>
> CCCCCCCC ....
>
> somehow results in lower _total_ memory consumption compared to a memory layout (=L2) such as:
>
> AA .... BBBB .... AA .... BBBB .... AA .... CCCCCCCC .... BBBB .... AA ....
>
> There exist many "...." regions in both cases. So, please post a sufficiently thorough explanation
> why you believe that the combined/total memory usage of those "...." is lower in the memory
> allocator layout L1 you are defending here and is higher in allocator L2.
>
> The main advantage of L1 over L2 isn't lower memory fragmentation - the main advantage
> is that L1 can fulfill a request in O(1) time, but L2 requires O(log(N)) time.
>
> -atom
First of all I have not defended the malloc implementation where the heap is implemented as a set of stacks in general, I have repeatedly stressed that this style of allocator is the best for long-lived processes, i.e. daemons/services, where it guarantees no performance degradation even after years of uptime, but it definitely is not the best kind of malloc implementation in other cases.
There is no doubt that the total memory usage is worse for the stack-based allocator and this is its main disadvantage.
The higher memory usage is mainly due to the allocated blocks being larger than the requested sizes in many cases, though this can be tuned for specific applications by choosing appropriate thresholds for the sizes so that the most frequent allocations will not need overhead. The size of each memory block is also increased by one pointer needed to allow linking into the stack, but that pointer can be used to point to a descriptor while the block is allocated, which can contain other information, like the size, which is also needed by most other malloc algorithms, so that may mean just an extra 4 bytes versus the other malloc algorithms, i.e. not all 8 bytes of the pointer are additional overhead.
The "malloc" and "free" times for the stack-based allocator are almost constant for most invocations, i.e. almost O(1), consisting of making a push or a pop in a stack (i.e. only one link manipulation), and the stack that must be used is found in constant time by one LZCNT instruction when there is one stack per octave of malloc requested size. An extra stack per octave needs just one more bit test instruction.
A "malloc" can occasionally request memory from the OS, but this becomes more and more rare after startup. "free" (or "malloc" or both, depending on an implementation choice) might need very seldomly a long time, only if the allocator is written to make attempts to detect the case when no allocation requests have been made in a long time with certain sizes and there is much memory in the stacks that could be returned to the OS.
So regarding the execution times, no other kind of "malloc" can do better than this, therefore the choice between others and the stack-based heap implementation is mainly a choice between efficient use of heap memory versus guarantees of no performance degradation in time together with optimal times for malloc/free.