2009-07-30

Scalability issues in CAD data exchange

(This is a re-post from Intel Software Network)

Continuing to eat our own dog food (i.e. to use Intel tools) let me share some recent practical experience.

I was measuring the parallelized part of the ACIS-SAT converter of CAD Exchanger (my personal project) to investigate its scalability. (SAT is a file format of ACIS, a well known 3D modeling kernel) The way to do this was to execute the same test case on the same architecture (e.g. Core 2 Duo or Core i7) changing the number of threads. Obviously, the ideal curve is 1/n where n is a number of cores. Below is the data I collected on a few files:



It includes measurements made on Core based Xeon 5300 (Clovertown) and the Corei7-based Xeon 3500 (Nehalem). Both are 4 cores with HT (hyper-threading). As you can see, Core i7 outpaces Core (each dashed curve is below the solid one) but the nature of the curve is the same for any file – it scales well from 1 to 2 to 4 but then becomes almost a constant (except maybe the pink one which corresponds to rather small ACIS file).

So why is that ? Discussing this issue with my colleagues at Intel, the first guess was that the application consumes all the bus bandwidth trying to load data from memory. That is, though the cores are free to execute the application cannot be fed it with data. We used another experimental Intel tool (Intel Performance Tuning Utility, available for free download at http://whatif.intel.com) to check this. This did not prove to be true – the experiments showed that though there are a lot of objects in memory they are allocated quite locally and can fit into Level 2 cache and so there are no bottlenecks of working with RAM.

So what else then ? Another guess was HT and that the OS perhaps not optimally distribute the workload among the cores and that a work running on the same core (there are actually 4 physical cores) but in different threads start competing for data. So I took another machine with 2 processors (i.e. 8 physical cores) with disabled HT. The data slightly changed – running in 5 threads gave speed up over 4 cores but then again, the time remained flat when increasing number of threads from 5 to 8.

To dive deeper into what happens I launched Intel Thread Profiler which (along with VTune Performance Analyzer) is a predecessor of Parallel Amplifier but has a nice feature which is not yet available in Amplifier – timeline.

I only focused on the parallelized part (highlighted on the image below).



As we can see, 7 other threads already finished their work (dark green became light green) while one thread (in this case, thread 1) continued to work. So this was an obvious example of non-scalability – adding more threads won’t help as there was only one working !

I execute parallel processing using tbb::parallel_for which has parameters of grain size and a partitioner. They define how the work is split into chunks. To track how TBB does this I used profiling API that added some marks to the timeline indicating start and end of processing each solid and their indexes.



We can see now that this time thread 5 was given the largest solids with indexes 14 and 15. That is, default values for grainsize and portioning strategy resulted that one thread was given several complex solids in a row which caused longer time to complete. Given that processing each solid involves multiple complex algorithms, it is safe to split processing into one solid at a time (this outweighs overhead implied by TBB for splitting). Here is a new code:

//use simple partitioner and grain size of 1 to ensure individual processing
tbb::parallel_for(tbb::blocked_range(1,aNbEntities+1,1),
ApplyPostProcess(this, theMap),
tbb::simple_partitioner() /*auto_partitioner()*/);

This immediately resulted in better performance! While one thread was processing a complex solid others were given those which remained. Of course, even in this case the duration will always be defined by the longest time of one solid processing. Moreover, even with such simplified approach there are still rooms for improvements – e.g., the workload can be sorted in the order of decreasing complexity. This will ensure that the ‘heaviest’ solids will be distributed and processed first, while ‘lighter’ solids will be deferred. This will significantly reduce chances of imbalances. Of course, this recommendation assumes that sorting is substantially faster than processing itself.

So, I hope this post will make you think how your workload is really parallelized, how you can use Intel tools to see this and how you can manage and improve it.

2009-07-21

CAD Exchanger Beta Update 1 is available

The CAD Exchanger product team is pleased to announce availability of the Beta Update 1 release of a new product to speed up translation of 3D data across multiple CAD formats.

Beta Update 1 can be freely downloaded from http://www.cadexchanger.com/download.html.

It addresses Beta users' feedback and introduces a few new features:
- New display mode - triangulation. Displays underlying triangulation of surface models.
- Ability to change precision of the surface models triangulation. This parameter affects visualization quality and conversion of the surface models into mesh formats (such as STL, VRML, etc)
- VRML 2.0 (1997) export is now supported.
- BRep and STL exporters now do not export hidden models (i.e. unchecked in the Model Explorer).
- Automatic checking for updates. User will be automatically notified when new product releases become available.
- Skipping invisible (blank) entities when importing IGES files at user's discretion.

For a full list please refer to Release Notes.

CAD Exchanger currently supports import, view, and export of IGES, STEP, STL, VRML and BRep formats (more are coming).

We continue to look forward to users' feedback that will help us prioritize development activities. Please share your experience at users' forum at http://www.cadexchanger.com/forum.


Sincerely,
The CAD Exchanger Team

2009-07-06

Developing parallel applications with Open CASCADE. Part 4

( continued...)

Open CASCADE threads
Open CASCADE provides thread abstraction in the form of OSD_Thread class that encapsulates OS-specific threads. It accepts a variable of the OSD_ThreadFunction type which is essentially a pointer to a function defined as follows:

typedef Standard_Address (*OSD_ThreadFunction) (Standard_Address data);

Here is a simple example of using Open CASCADE threads:

static Standard_Mutex aMutex;

Standard_Address Test_ThreadFunction (Standard_Address /*theData*/)
{
Standard_Mutex::Sentry aSentry (aMutex);
std::cout << "Running in worker thread id:" << OSD_Thread::Current() << std::endl;
}

int main (int argc, char *argv[])
{
const int MAX_THREAD = 5;
OSD_Thread aThreadArr[MAX_THREAD];

std::cout << "Running in master thread id: " << OSD_Thread::Current() << std::endl;

OSD_ThreadFunction aFunc = Test_ThreadFunction;
int i;
for (i = 0; i < MAX_THREAD; i++) {
aThreadArr[i].SetFunction (aFunc);
}
for (i = 0; i < MAX_THREAD; i++) {
if (!aThreadArr[i].Run (NULL))
std::cerr << "Error: Cannot start thread " << i << std::endl;
}

for (i = 0; i < MAX_THREAD; i++) {
Standard_Address aRes;
if (!aThreadArr[i].Wait (aRes))
std::cerr << "Error: Cannot get result of thread " << i << std::endl;
}
return 0;
}

The code of a function accepted by OSD_Thread::SetFunction() will be executed in a separate thread. Running the above example you should see different id's for master and worker threads.

OSD_Thread::Run() instance accepts a parameter of type void* which can be a pointer to some data object or is a type cast from some compatible type (e.g. integer). The same convention applies to an output parameter of OSD_Thread::Wait().

Of course, you are not obliged to use OSD_Threads and may prefer to use other implementations (system threads, Qt, Intel Threading Building Blocks, Open MP, etc). I have become a big fan of Intel TBB and am using it in CAD Exchanger.

Synchronization objects
Currently Open CASCADE provides the only synchronization object called Standard_Mutex which corresponds to Critical Section on Windows. You should get some knowledge on this subject to effectively use synchronization in your code.

For instance, to minimize contention you should protect the minimum required section of the code with a synchronization object. Using Standard_Mutex::Sentry you may create an additional nested scope for that:

void MyClass::MyMethodThatCanRunConcurrently()
{
//thread-safe code
...
{
//code requiring serialization
Standard_Mutex::Sentry aSentry (myMutex);
MyNotConcurrentMethod();
}

//thread-safe code again
...
}

The most convenient and efficient use of Standard_Mutex is through Standard_Mutex::Sentry which implements a scoped object which acquire and releases locks on a mutex. The Sentry object locks the mutex during its construction and unlocks it during destruction. This guarantees that the mutex will be unlocked even in the case of an exception raised during execution. Otherwise this could create a deadlock – other threads continue to wait for the locked mutex while there is no one who could unlock it.

It must be noted that Standard_Mutex is currently implemented inefficiently and therefore can hardly be recommended for use as is. Its method Standard_Mutex::Lock() is uses repetitive calls to TryEnterCriticalSection() and sleep(). This may result in wasting processor resources instead of yielding them to another task. It seems that developers attempted to implement a spin lock which can be useful for very short critical sections (where it is more efficient to spin than to immediately go into wait mode). However this must have been done differently – using critical section with spin count. I had to patched Standard_Mutex to eliminate this TryEnter...() behavior.

During my earlier experiments I have implemented other objects – wait condition and semaphore – but as they did not make part of Open CASCADE, there is no point describing them here.

(to be continued...)