Intel Parallel Studio launched !

Today is THE day! Intel announced availability of the Parallel Studio, the suite of software tools to ease development of parallel applications. Work on it has been my best experience at Intel so far. I do adore it !

Respected 3rd party announcements: Dr. Dobbs, SD Times

Parallel Studio home, featuring Open CASCADE team's testimonial:

"Intel® Parallel Inspector and Intel® Parallel Amplifier greatly simplified the task of finding hotspots and memory leaks. We were pleased with the 2X overall performance improvement and the elimination of several previously unidentified memory leaks."
– Vlad Romashko
Software Development Manager
OpenCascade S.A.S.


Parallelizing C++ streams

This post won't be directly about Open CASCADE but it may be helpful to those who are looking forward to parallelizing their applications. My personal strong belief is that multi-threading is unavoidable for long-term success and one must be prepared to make steps into that direction (better sooner than later). Era of free MHz is over, multi-threading is the only choice to scale in the future.

So. I am now developing an ACIS translator for CAD Exchanger and am designing its architecture to employ multi-threading whenever reasonable. Keeping it efficient, streamlined and light-weight forces to think and to refactor implementation several times but it pays off.

One of the examples where I was looking to integrate multi-threading was a conversion between persistent representation of an ACIS file (barely with a set of containers of ascii strings directly read from the file) into transient representation (which is a set of C++ objects representing ACIS types with data fields and cross-references between each other). The approach is that at the first pass empty transient objects are created and at the second their data is retrieved from persistent objects and references are created using already created placeholders. This allows to process each object fully independently from the others and thus represents an excellent case for multi-threading. First I made a sequential conversion and on the file of 15Mb consisting of 110,000+ entities this mapping took ~3 seconds. This became a benchmark to compete with.

To ease parsing of persistent entities strings I use C++ streams that wrap char* buffers, and operator >> to recognize doubles, integers, characters, etc. The makes the code concise and easy to understand. To enable parallelism I am using Intel Threading Building Blocks, a software library (available under commercial and Open Source license) facilitating solving many frequent tasks met in development of multi-threaded software including parallel loops, concurrent data containers, synchronization objects and so on. It already won several software awards and gains recognition of broad developers audience world-wide. It is developed in the same team that develops Parallel Amplifier, Inspector (and now Advisor) where I currently am at Intel.

The code looked quite straightforward (sequential part is commented out):

/*! \class ApplyPaste
\brief The ApplyPaste class is used for concurrent mapping of persistent representation into transient.

This is a functor class supplied to Intel TBB tbb::parallel_for(). It
uses ACISBase_RMapper to perform conversion on a range of entities that TBB
will feed to it.
class ApplyPaste {

//! Creates an object.
/*! Stores \a theMapper and model entites for faster access.*/
ApplyPaste (const Handle(ACISBase_RMapper)& theMapper) : myMapper (theMapper),
myMap (theMapper->File()->Entities())

//! Performs mapping on a range of entities
/*! Uses ACISBase_RMapper::Paste() to perform mapping.
The range \a r is determined by TBB.
void operator() (const tbb::blocked_range& r) const
const ACISBase_File::EntMap& aPEntMap = myMap;
const Handle(ACISBase_RMapper)& aMapper = myMapper;
Handle (ACISBase_ACISObject) aTarget;
for (size_t i = r.begin(); i != r.end(); ++i) {
const Handle(ACISBase_PEntity)& aSource = aPEntMap(i);
aMapper->Paste (aSource, aTarget);
const ACISBase_File::EntMap& myMap;
const Handle(ACISBase_RMapper)& myMapper;

/*! Uses either consequential or parallel implementation.
void ACISBase_RMapper::Paste()
boost::timer aTimer;
Standard_Integer aNbEntities = myFile->NbEntities();

tbb::parallel_for(tbb::blocked_range(0,aNbEntities), ApplyPaste(this),

//const ACISBase_File::EntMap& aPEntMap = myFile->Entities();
//Handle (ACISBase_ACISObject) aTarget;
//for (Standard_Integer i = 0; i < aNbEntities; i++) {
// const Handle(ACISBase_PEntity)& aSource = aPEntMap(i);
// Paste (aSource, aTarget);

Standard_Real aSecElapsed = aTimer.elapsed();
cout << "ACISBase_RMapper::Paste() execution elapsed time: " << aSecElapsed << " s" << endl;

How disappointed was I to get a new elapsed time of ... 17sec on my new Core 2 Duo laptop (instead of 3secs in sequential code) ! What the heck ?! Obviously it could not be attributed to overhead caused by tbb, otherwise there was no point in using it. But what then ?

I immediately launched Intel Parallel Amplifier to see what goes wrong. Here is what I saw:

Unused CPU time (i.e. when one or more cores were not working) was 33.8s i.e. at least one core did not work. Hotspot tree showed that there was some critical section (a synchronization object that regulates exclusive access to some shared resource) called from std::_Lockit::_Lockit() constructor which itself most of the times was called from std::locale::facet::_Incref() or _Decref(). Mystery at the first glance. So I rebuilt my app in debug mode and started debugger and what ? Everything became clear.

The root cause is a critical section used to protect a common locale object. operator >>() inside creates a basic_istream::sentry object on the stack. Its constructor calls (through another method) ios_base::locale() which returns a std::locale object by calling its copy constructor (see syntax below). The copy constructor calls Incref() to increment a reference counter. Incrementing reference counter is surrounded by a critical section.

locale __CLR_OR_THIS_CALL getloc() const
{ // get locale
return (*_Ploc);

So, all streams compete for the same locale object! Moreover, critical section is created with spin count = 0. That means if one thread tries and fails to acquire a lock (enter critical section) while another thread is using it, it immediately goes into a sleep mode. When the lock gets freed the thread gets awaken. But all this is extremely expensive and therefore it creates that very overhead ! Should spin count be a non-null then it would run much faster – the spin count determines amount of tries to acquire a lock before the thread goes to sleep. For example, memory management routines (e.g. malloc()) use spin count of 4000 or so, and this makes multi-threaded apps run effectively even concurrently allocating memory. Why not to do the same for streams ?

OK, I tried to persist and talked to colleagues around. One of them gave me a link to http://msdn.microsoft.com/en-us/library/ms235505.aspx which among the rest discusses thread-specific locale. This looked promising. But after experimenting and reading http://msdn.microsoft.com/en-us/library/ms235302(VS.80).aspx I found this of no help :-(. The matter of the fact is that only C runtime locale can be made local to threads while C++ streams always use global locale::global. Gosh ! Microsoft ! Why ?!

So this is here I am now. I will keep on searching but if you ever dealt with using STL streams in multi-threaded environments or barely heard about this please let me know. I will appreciate any reference. The option to implement an own string parser (to avoid STL streams) is currently the least preferred but eventually is not excluded.

Nonetheless, experience with TBB runs extremely positively. I have integrated it into conversion part which converts transient representation into Open CASCADE shapes. Parallel execution and using of tbb::concurrent_hash_map for storage outperforms sequential implementation and will scale well over larger number of cores. I'll keep you posted on how it's going.

Take care,


[off-topic] ¡Hola! – Back from Spain

It has been a long month after my previous post on this blog. As we achieved an RTM (Release-To-Manufacturing) milestone with the Parallel Studio I decided to take some break and we spent a few days in Spain with my family. This was a first time for me to be there and I would definitively like to return. We could not see everything we hoped to – swine flu had adjusted our plans as we arrived there :-(.

In Europe, Spain is known to be affected largely enough by the developing financial crisis. Though I deem the crisis is for good overall, I am very sympathetic to people who might be affected by it. Unlike many optimists out there I believe that much worse is still ahead of us.

What was good to see is that Spain does a lot of things that are a right answer for this downturn. This started with relatively low trip prices which did not seem believable a year ago. To retain tourist traffic the consulate opens a multi-entry Shengen visa valid for 180 days. For a reasonable trip price we got an excellent 3* hotel near Barcelona, one block from the sea, which exceeded some Egyptian 5* hotels. Upon arrival we were given a full boarding though we only paid for HB, and the food was spectacular. As it's not yet a high season (and we normally try to take advantages of this) and preparing for the worse, sellers are very open to offer discounts. As an example, Port Aventura, an entertainment park was offering 50% discount giving 2 days ticket for the price of 1. My daughter was happy to enjoy this.

Consequences of a recent boom are still observed. Like many other countries, Spain fell a victim of a real estate bubble. In a main street of a tourist town where we stayed real estate agencies were met every 100m or even more frequently. More than a half did not show signs of life for all our stay. Some have just been left abandoned. Stopped constructions, even in lovely locations, even almost completed, were frequent. Sales ads on every (!) multi-apartment house along the entire beach, sometimes up to 10+ on each. Natural consequences of craziness. It's good that Spain started it earlier, this is yet to come in Russia.

Regarding sight seeing, we spent most of the days in Barcelona. Of course, (almost) all "must see" locations – The Gothic Castle, Sagrada de Familia, Gaudi buildings, Guel park, Rambla and many other things. We were impressed with St. Maria del Mar church which appeared so small from outside and that monumental inside with a big Rose above the entrance. We could not make the Picasso museum twice, so this remains at least one reason to return. We went another day to Figueres, the home town of Salvador Dali, where he established a museum in the building of a once burnt out theater. What a special man ! His will was to get buried between two lavatory pans and it was fulfilled – his grave is under two working toilets in the museum. As we were walking through the museum I could not say it was too impressive (of course, tiredness added to that) but returning back home I re-thought and concluded that you can't assess a big thing when you are near it, you have to step back to realize it better.

So overall it was a great trip! If there are Spanish readers of this blog – you live in a great country, full of glory history and you have all the rights to be proud of it !

As for me, I returned back to work and we are now full steam ahead for new challenges and projects. I try to find some time to keep on developing CAD Exchanger. There are some interesting findings in that development so I hope to post some of them here.

Take care everyone.