2009-06-23

Developing parallel applications with Open CASCADE. Part 2

(continued...)
By the way, you must set this environment variable (MMGT_OPT) as well as other ones that control Open CASCADE memory management, before your application starts. This is important as selection of a memory manager is done during dll's loading. This is also a very inconvenient limitation that forbids to assign a custom memory manager in run-time.

Due to extensive memory allocation and deallocation during app life-cycle, it may become a hotspot. Look for instance, at the following screenshot received on a CAD Exchanger ACIS translator.



When measuring concurrency and waits & locks (the other two analysis types offered by Amplifier) using direct OS memory manager (as Open CASCADE one could not be used as explained in a previous post), I noticed that it also causes the greatest wait time.



On one hand, this is a very good indication that the rest of the code runs pretty fast. On the other, it indicates that memory management can really become a bottleneck. Analyzing the problem deeper I switched to the mode to see direct OS functions (toggling off the button on Amplifier toolbar) and here is what I saw:



What does it show to us ? That 2 system functions – RtlpFindAndCommitPages() and ZwWaitForSingleObject() – are hotspots and stack unwinding shows that they are called from memory allocation / deallocation routines !

The first hotspot is explained by the fact that ACIS translator creates multiple tiny objects containing results of translation (on this particular test case – 22000+). This causes strong memory fragmentation which forces the system to constantly look for new memory chunks.

The second (which goes through critical section) is caused by the default mechanism of memory management on Windows. As you might know, heap allocation in Windows is done sequentially using a mutex (critical section) with spin count of ~4000, i.e. when one thread requests memory allocation and in parallel another one tries to do the same, this latter thread does not go immediately to sleep mode but 'spins for 4000 times letting the former one to complete.

All this is caused by the direct use of calloc/malloc/free, and new/delete. To overcome this issue I have tried a technique offered by Intel Threading Building Blocks 2.2 which allows to substitute all calls to C/C++ memory management with calls to tbb allocator. This is done extremely easy with including a simple statement

#include "tbb/tbbmalloc_proxy.h"

The TBB allocator runs concurrently (without locks inside) and also works in a way similar to Open CASCADE – reuses once allocated blocks without returning them to the OS. This solved both hotspots issue and gave additional speed! Check out comparison of results:



On a side note, notice that on previous images there are not only calls via Standard_MMgrRaw::Allocate() and ::Free() (which are called via Standard::Allocate() and ::Free() when MMGT_OPT=0). There are also direct calls from arrays (TColgp_Array1OfPnt2d, etc) and others (highlighted in violet). They correspond to calls of new[] operator which is not redefined in Open CASCADE classes and therefore bypass Open CASCADE memory manager. Andrey Betenev once pointed this out to me, and here are the clear evidence of this observation. So, this should lead to an action on OCC team side, if they want to fix this oversight.

So, the summary are:
- default (optimized) Open CASCADE memory manager (MMGT_OPT=1) is not applicable to the parallel apps
- raw memory manager (MMGT_OP=0) forwarding calls to OS malloc/free can lead to fragementations and bottlenecks on workloads extensively using memory;
- to overcome this you need special memory manager, for instance Intel TBB that substitutes OS memory management routines.

(to be continued...)

4 comments:

  1. When you mention about including "#include "tbb/tbbmalloc_proxy.h"" within the source file, can this be done in the application's stdafx.h or does this need to be included in every file?

    ReplyDelete
  2. Anonymous found the documentation and just needs to use the #include once in the main application. I did this but then I get a crash in debug mode with CrtIsValidPointer.

    This is caused by the destructor of std::vector< void*>. The crash is within _CrtIsValidHeapPointer. This suggests that some memory was allocated with one type of allocator in a different module, and trying to free the memory in the main module.

    I know the above is abit vague, but do you have any pointers as to what is going wrong?

    ReplyDelete
  3. #include'ing tbbmalloc_proxy.h in a single cxx file is enough. What it basically does is linking with tbbmalloc_proxy.dll which (when loaded during run-time) replaces the calls to new, malloc, delete. etc.

    Regarding _CrtIsValidHeapPointer, this is an interesting issue. The tbb allocator should be able to recognize if the memory was allocated by it or by other (e.g. system) allocator. In the latter case it does not try to free it obviously. Which tbb version have you tried ? We have recently released 3.0 which might help. Could you try it and/or send a small reproducer so that we can try this with our development team at Intel ?

    ReplyDelete
  4. I've upgraded to v3 and in release mode the application got a lot further before crashing. I'll investigate in debug mode to see if it's related to memory usage.

    ReplyDelete