By the way, you must set this environment variable (MMGT_OPT) as well as other ones that control Open CASCADE memory management, before your application starts. This is important as selection of a memory manager is done during dll's loading. This is also a very inconvenient limitation that forbids to assign a custom memory manager in run-time.
Due to extensive memory allocation and deallocation during app life-cycle, it may become a hotspot. Look for instance, at the following screenshot received on a CAD Exchanger ACIS translator.
When measuring concurrency and waits & locks (the other two analysis types offered by Amplifier) using direct OS memory manager (as Open CASCADE one could not be used as explained in a previous post), I noticed that it also causes the greatest wait time.
On one hand, this is a very good indication that the rest of the code runs pretty fast. On the other, it indicates that memory management can really become a bottleneck. Analyzing the problem deeper I switched to the mode to see direct OS functions (toggling off the button on Amplifier toolbar) and here is what I saw:
What does it show to us ? That 2 system functions – RtlpFindAndCommitPages() and ZwWaitForSingleObject() – are hotspots and stack unwinding shows that they are called from memory allocation / deallocation routines !
The first hotspot is explained by the fact that ACIS translator creates multiple tiny objects containing results of translation (on this particular test case – 22000+). This causes strong memory fragmentation which forces the system to constantly look for new memory chunks.
The second (which goes through critical section) is caused by the default mechanism of memory management on Windows. As you might know, heap allocation in Windows is done sequentially using a mutex (critical section) with spin count of ~4000, i.e. when one thread requests memory allocation and in parallel another one tries to do the same, this latter thread does not go immediately to sleep mode but 'spins for 4000 times letting the former one to complete.
All this is caused by the direct use of calloc/malloc/free, and new/delete. To overcome this issue I have tried a technique offered by Intel Threading Building Blocks 2.2 which allows to substitute all calls to C/C++ memory management with calls to tbb allocator. This is done extremely easy with including a simple statement
The TBB allocator runs concurrently (without locks inside) and also works in a way similar to Open CASCADE – reuses once allocated blocks without returning them to the OS. This solved both hotspots issue and gave additional speed! Check out comparison of results:
On a side note, notice that on previous images there are not only calls via Standard_MMgrRaw::Allocate() and ::Free() (which are called via Standard::Allocate() and ::Free() when MMGT_OPT=0). There are also direct calls from arrays (TColgp_Array1OfPnt2d, etc) and others (highlighted in violet). They correspond to calls of new operator which is not redefined in Open CASCADE classes and therefore bypass Open CASCADE memory manager. Andrey Betenev once pointed this out to me, and here are the clear evidence of this observation. So, this should lead to an action on OCC team side, if they want to fix this oversight.
So, the summary are:
- default (optimized) Open CASCADE memory manager (MMGT_OPT=1) is not applicable to the parallel apps
- raw memory manager (MMGT_OP=0) forwarding calls to OS malloc/free can lead to fragementations and bottlenecks on workloads extensively using memory;
- to overcome this you need special memory manager, for instance Intel TBB that substitutes OS memory management routines.
(to be continued...)