2009-06-29

Developing parallel applications with Open CASCADE. Part 3

( continued...)

Data races
Data races is one of the most frequent problems that pop up when transitioning from sequential to parallel programming. When you develop a sequential code you usually make totally different assumptions on how you code would execute. It may require some imagination to understand how your code may work in multi-threaded environment and to anticipate which problems may happen.

Data races may result in really unpredictable application behavior – from sometimes correct results to non-reproducible incorrect results, and to spontaneous crashes. If you observe such behavior it may likely be due to a data race in your code or in 3rd party library. To catch it you can use Intel Parallel Inspector which is able to identify different memory and thread errors. Here is how it looks when catching an error in Open CASCADE:



Open CASCADE contains quite a lot of data races prone places if used in parallel applications as is. I first mentioned about data races in OpenCASCADE when experimented with IGES translation last year (search for forum threads on that). Working on ACIS importer revealed a few more. Most frequent examples are the following:

a). return of const& to a static variable inside the function. For instance:
const gp_Dir& gp::DX()
{
static gp_Dir gp_DX(1,0,0);
return gp_DX;
}

This works perfectly fine in a single-threaded application but may create a problem in a parallel one. When two threads simultaneously call gp::DX() for the first time, there is concurrent write-access to gp_DX variable and result is unpredictable. The fix in this case is to put definition outside of the function and thus to initialize the variable during load time:

static gp_Dir gp_DX(1,0,0);
const gp_Dir& gp::DX()
{
return gp_DX;
}

b). Use of static variables defined in a file.
These cases range from forced use when a variable cannot be part of an argument list or a class member to ugly usages that I can only attribute to non-understanding of C++. Fixes for these cases were symmetrical – from using Standard_Mutex whenever it was unavoidable to creating auto variables on a stack.

The example of the first case was GeomConvert_ApproxCurve.cxx which defines a C-style function that must use data outside of it (e.g. as a member of a class that calls it). The API of this function is prescribed elsewhere and itself is specified as an argument to some class.

static Handle(Adaptor3d_HCurve) fonct = NULL;
static Standard_Real StartEndSav[2];

extern "C" void myEval3d(Standard_Integer * Dimension,
// Dimension
Standard_Real * StartEnd,
// StartEnd[2]
Standard_Real * Param,
// Parameter at which evaluation
Standard_Integer * Order,
// Derivative Request
Standard_Real * Result,
// Result[Dimension]
Standard_Integer * ErrorCode)
// Error Code
{
...
StartEndSav[0]=StartEnd[0];
StartEndSav[1]=StartEnd[1];
...
}

In this case I had to protect fonct and StartEndSav with Standard_Mutex in the caller class to prevent data race. Reading Open CASCADE 6.3.1 Release Notes, I noticed that this issue was addressed and the OCC team made deeper changes introducing a class (not a C function) what allows to handle data transfer in a way not involving synchronization. This is good and I will be able to roll back my fixes.

c). Special case of BSplCLib, BSplSLib and PLib.

This case was already discussed in the blog last December in context of speeding up Boolean Operations. Let me briefly remind you the issue. For B-Splines calculations, version 6.3.0 (and prior) uses an auxiliary array of doubles allocated on heap; it is reallocated if new requested size is greater than previously allocated or is simply reused if already allocated buffer suffices. This is done in order to avoid frequent allocation/deallocation as this creates a bottleneck but obviously is not thread-safe.

So discussing then, Andrey Betenev suggested using auto variables (allocated on stack) as there is a limitation of a maximum B-Spline degree. Digging the code I realized that though in 99+% cases this will work, there can still be cases when any maximum threshold can be exceeded. This is because the number of poles in B-Splines is unlimited (and there are curves featuring more than 8192 poles!). So I implemented an improved alternative that use auto-buffers if the requested size is less than some threshold and uses heap otherwise.

template class BSplCLib_DataBuffer
{
public:

//8K * sizeof (double) = 64K
static const size_t MAX_ARRAY_SIZE = 8192;

BSplCLib_DataBuffer (const size_t theSize) : myHeap (0), myP (myAuto)
{
Allocate(theSize);
}

BSplCLib_DataBuffer () : myHeap (0), myP (myAuto) {}

virtual ~BSplCLib_DataBuffer()
{ Deallocate(); }

void Allocate (const size_t theSize)
{
Deallocate();
if (theSize > MAX_ARRAY_SIZE)
myP = myHeap = new T [theSize];
else
myP = myAuto;
}

operator T* () const
{ return myP; }


protected:

BSplCLib_DataBuffer (const BSplCLib_DataBuffer&) : myHeap (0), myP (myAuto) {}
BSplCLib_DataBuffer& operator= (const BSplCLib_DataBuffer&) {}

void Deallocate()
{
if (myHeap) {
delete myHeap;
myHeap = myP = 0;
}
}

T myAuto [MAX_ARRAY_SIZE];
T* myHeap;
T* myP;
};

(to be continued...)

2009-06-23

Developing parallel applications with Open CASCADE. Part 2

(continued...)
By the way, you must set this environment variable (MMGT_OPT) as well as other ones that control Open CASCADE memory management, before your application starts. This is important as selection of a memory manager is done during dll's loading. This is also a very inconvenient limitation that forbids to assign a custom memory manager in run-time.

Due to extensive memory allocation and deallocation during app life-cycle, it may become a hotspot. Look for instance, at the following screenshot received on a CAD Exchanger ACIS translator.



When measuring concurrency and waits & locks (the other two analysis types offered by Amplifier) using direct OS memory manager (as Open CASCADE one could not be used as explained in a previous post), I noticed that it also causes the greatest wait time.



On one hand, this is a very good indication that the rest of the code runs pretty fast. On the other, it indicates that memory management can really become a bottleneck. Analyzing the problem deeper I switched to the mode to see direct OS functions (toggling off the button on Amplifier toolbar) and here is what I saw:



What does it show to us ? That 2 system functions – RtlpFindAndCommitPages() and ZwWaitForSingleObject() – are hotspots and stack unwinding shows that they are called from memory allocation / deallocation routines !

The first hotspot is explained by the fact that ACIS translator creates multiple tiny objects containing results of translation (on this particular test case – 22000+). This causes strong memory fragmentation which forces the system to constantly look for new memory chunks.

The second (which goes through critical section) is caused by the default mechanism of memory management on Windows. As you might know, heap allocation in Windows is done sequentially using a mutex (critical section) with spin count of ~4000, i.e. when one thread requests memory allocation and in parallel another one tries to do the same, this latter thread does not go immediately to sleep mode but 'spins for 4000 times letting the former one to complete.

All this is caused by the direct use of calloc/malloc/free, and new/delete. To overcome this issue I have tried a technique offered by Intel Threading Building Blocks 2.2 which allows to substitute all calls to C/C++ memory management with calls to tbb allocator. This is done extremely easy with including a simple statement

#include "tbb/tbbmalloc_proxy.h"

The TBB allocator runs concurrently (without locks inside) and also works in a way similar to Open CASCADE – reuses once allocated blocks without returning them to the OS. This solved both hotspots issue and gave additional speed! Check out comparison of results:



On a side note, notice that on previous images there are not only calls via Standard_MMgrRaw::Allocate() and ::Free() (which are called via Standard::Allocate() and ::Free() when MMGT_OPT=0). There are also direct calls from arrays (TColgp_Array1OfPnt2d, etc) and others (highlighted in violet). They correspond to calls of new[] operator which is not redefined in Open CASCADE classes and therefore bypass Open CASCADE memory manager. Andrey Betenev once pointed this out to me, and here are the clear evidence of this observation. So, this should lead to an action on OCC team side, if they want to fix this oversight.

So, the summary are:
- default (optimized) Open CASCADE memory manager (MMGT_OPT=1) is not applicable to the parallel apps
- raw memory manager (MMGT_OP=0) forwarding calls to OS malloc/free can lead to fragementations and bottlenecks on workloads extensively using memory;
- to overcome this you need special memory manager, for instance Intel TBB that substitutes OS memory management routines.

(to be continued...)

2009-06-20

Performance tests of CAD Exchanger ACIS translator

As I already mentioned, I have been developing an ACIS translator for CAD Exchanger designing its architecture for multi-core from the very beginning. Today I performed a series of performance tests of its Alpha version executing multiple-files (from 150Kb to 55Mb) on different hardware (2/4/8 cores). The same set of tests have been executed against another ACIS translator and I am proud to state that mine has fully outperformed it ! The lowest gain was about 15% on a smaller file and the highest gain was 10x on the biggest one (it took CAD Exchanger ACIS translator only 23 seconds while the competitor needed 230seconds to translate a model of 55Mb). Obviously the latter gain will be most noticeable for the user.

I made some video running a live test of one model to demonstrate parallelism on an Intel Core i7 workstation. Sorry for sound artifacts – I was simply using my laptop for video recording, so you may notice hard disk crunching :-(.



There are several phases in the ACIS translator – parsing a file, creating a transient file model in memory, converting it into Open CASCADE shape, and eventually shape analysis and correction. Conversion and correction which take the largest share of time are fully concurrent and scale very well in most cases. Parsing a file is obviously a single-threaded task. And a phase of creating a transient model, even though architecture allows to make it fully parallel I had to execute sequentially being a hostage of Microsoft STL bug mentioned in an earlier post. With growing number of cores this phase becomes considerable comparing to others thus making the whole translation worse scalable. Hope that Microsoft will fix it earlier than I decide to implement own string parsing ;-).

The code takes advantage of Intel Threading Building Blocks which I loved for its ease of use, excellent scalability and very low overhead (I did not notice much difference running TBB with only one thread vs simple sequential execution). There are also a couple of know-how architectural algorithms that made this concurrency possible, which I am quite proud of;-).

Development of the ACIS translator will continue and should make it into the CAD Exchanger GUI eventually. It will also be offered as a C++ software library that you could use in your Open CASCADE-based application to read ACIS-SAT files. By the way, I was impressed to find hundreds of ACIS files all over the internet which really underlines its popularity.
If you would like to evaluate it please let me know.

Thanks !

2009-06-09

Developing parallel applications with Open CASCADE. Part 1

As mentioned in an earlier post, I am now working on an ACIS importer for CAD Exchanger and am developing it to be parallel. The results are very promising so far (except STL streams parsing due to the Microsoft bug mentioned in that post, though I'm now installing VS2008SP1 to check if it has been fixed). So I decided to share my experience in this blog and hope it will be useful for other developers (questions on the forum about parallelism appear more frequently now). Also hope that Open CASCADE team will benefit from these findings.

As a quick note already made several time, let me underline that in multi-core era parallel applications will become mainstream in some mid-term and you better prepare yourself for that trend now. This can be good for your career path and these skills can be among your competitive advantages. Recently released Intel Parallel Studio eases work of developers to develop and to debug multi-threaded applications (it has become a vital part of my toolbox). There are good basic books on this subject. I am now reading "Patterns for Parallel Programming" by Timothy Mattson et al (recommended by one my skilled colleague), an analogue of famous "Design Patterns" by Erich Gamma (the must read book for any professional software developer). It helped me shape the architecture of the ACIS importer for CAD Exchanger.

Getting back to Open CASCADE, I can definitively say that it is quite usable for parallel applications but like any other software library requires precautions.

General comments.
You will get major gain if you look at your problem from a higher level, from the problem domain, and not from a particular algorithm perspective. Identify what can be done concurrently and what requires sequential execution. For instance, in my initial experiments with IGES translation (see here) I intentionally focused on parallelizing translation of components of IGES groups and this worked reasonably well. But overall gain for entire translation was very modest because it was not this part that took most part – it was Shape Healing that ate about 50%-70% of a total time, and it remained sequential. So, until you restructure your algorithms your improvements will be limited. Working on the ACIS reader I had to completely redesign traditional architecture of waterfall approach (when a model is traversed from a root entity down to a leaf) and consequent shape healing. This gave tremendous outcome. The Patterns book gives concrete recommendations that help identify ‘patterns' in your algorithms.

Handles
With 6.2.x Open CASCADE handles (hierarchy of Handle_Standard_Transient subclasses) come with thread-safe implementation for reference counting. To take advantage of that you must either define MMGT_REENTRANT system variable to non-null or call Standard::SetReentrant (Standard_True) before you start using handles across threads. This makes reference counter to rely on atomic increment/decrement (e.g. InterlockedIncrement() on Windows) instead of ++ which can be not atomic.

void Handle(Standard_Transient)::BeginScope()
{
if (entity != UndefinedHandleAddress)
{
if ( Standard::IsReentrant() )
Standard_Atomic_Increment (&entity->count);
else
entity->count++;
}
}


Memory management
By default, Open CASCADE ships with MMGT_OPT variable defined to 1. This forwards all Standard::Allocate() and ::Free() calls to the Open CASCADE memory manager (Standard_MMgrOpt) which optimizes memory allocation mitigating memory fragmentations. (Probably it deserves a separate post to describe insights of the memory manager.)

Standard_MMgrOpt is thread-safe itself and safely regulates simultaneous memory allocation/deallocation requests from several threads. However it is based on Standard_Mutex (more on it in the future) which in its current implementation introduces extreme overhead and makes this optimized memory management worthless in parallel applications (while in a single threaded environment it works just fine).

So, to overcome this deficiency you should use MMGT_OPT=0. It activates Standard_MMgrRaw that simply forward calls to malloc/free…

(to be continued...)