Parallelizing C++ streams
This post won't be directly about Open CASCADE but it may be helpful to those who are looking forward to parallelizing their applications. My personal strong belief is that multi-threading is unavoidable for long-term success and one must be prepared to make steps into that direction (better sooner than later). Era of free MHz is over, multi-threading is the only choice to scale in the future.
So. I am now developing an ACIS translator for CAD Exchanger and am designing its architecture to employ multi-threading whenever reasonable. Keeping it efficient, streamlined and light-weight forces to think and to refactor implementation several times but it pays off.
One of the examples where I was looking to integrate multi-threading was a conversion between persistent representation of an ACIS file (barely with a set of containers of ascii strings directly read from the file) into transient representation (which is a set of C++ objects representing ACIS types with data fields and cross-references between each other). The approach is that at the first pass empty transient objects are created and at the second their data is retrieved from persistent objects and references are created using already created placeholders. This allows to process each object fully independently from the others and thus represents an excellent case for multi-threading. First I made a sequential conversion and on the file of 15Mb consisting of 110,000+ entities this mapping took ~3 seconds. This became a benchmark to compete with.
To ease parsing of persistent entities strings I use C++ streams that wrap char* buffers, and operator >> to recognize doubles, integers, characters, etc. The makes the code concise and easy to understand. To enable parallelism I am using Intel Threading Building Blocks, a software library (available under commercial and Open Source license) facilitating solving many frequent tasks met in development of multi-threaded software including parallel loops, concurrent data containers, synchronization objects and so on. It already won several software awards and gains recognition of broad developers audience world-wide. It is developed in the same team that develops Parallel Amplifier, Inspector (and now Advisor) where I currently am at Intel.
The code looked quite straightforward (sequential part is commented out):
/*! \class ApplyPaste
\brief The ApplyPaste class is used for concurrent mapping of persistent representation into transient.
This is a functor class supplied to Intel TBB tbb::parallel_for(). It
uses ACISBase_RMapper to perform conversion on a range of entities that TBB
will feed to it.
*/
class ApplyPaste {
public:
//! Creates an object.
/*! Stores \a theMapper and model entites for faster access.*/
ApplyPaste (const Handle(ACISBase_RMapper)& theMapper) : myMapper (theMapper),
myMap (theMapper->File()->Entities())
{
}
//! Performs mapping on a range of entities
/*! Uses ACISBase_RMapper::Paste() to perform mapping.
The range \a r is determined by TBB.
*/
void operator() (const tbb::blocked_range
{
const ACISBase_File::EntMap& aPEntMap = myMap;
const Handle(ACISBase_RMapper)& aMapper = myMapper;
Handle (ACISBase_ACISObject) aTarget;
for (size_t i = r.begin(); i != r.end(); ++i) {
const Handle(ACISBase_PEntity)& aSource = aPEntMap(i);
aMapper->Paste (aSource, aTarget);
}
}
private:
const ACISBase_File::EntMap& myMap;
const Handle(ACISBase_RMapper)& myMapper;
};
/*! Uses either consequential or parallel implementation.
*/
void ACISBase_RMapper::Paste()
{
boost::timer aTimer;
Standard_Integer aNbEntities = myFile->NbEntities();
//parallel
tbb::parallel_for(tbb::blocked_range
tbb::auto_partitioner());
//sequential
//const ACISBase_File::EntMap& aPEntMap = myFile->Entities();
//Handle (ACISBase_ACISObject) aTarget;
//for (Standard_Integer i = 0; i < aNbEntities; i++) {
// const Handle(ACISBase_PEntity)& aSource = aPEntMap(i);
// Paste (aSource, aTarget);
//}
Standard_Real aSecElapsed = aTimer.elapsed();
cout << "ACISBase_RMapper::Paste() execution elapsed time: " << aSecElapsed << " s" << endl;
}
How disappointed was I to get a new elapsed time of ... 17sec on my new Core 2 Duo laptop (instead of 3secs in sequential code) ! What the heck ?! Obviously it could not be attributed to overhead caused by tbb, otherwise there was no point in using it. But what then ?
I immediately launched Intel Parallel Amplifier to see what goes wrong. Here is what I saw:
Unused CPU time (i.e. when one or more cores were not working) was 33.8s i.e. at least one core did not work. Hotspot tree showed that there was some critical section (a synchronization object that regulates exclusive access to some shared resource) called from std::_Lockit::_Lockit() constructor which itself most of the times was called from std::locale::facet::_Incref() or _Decref(). Mystery at the first glance. So I rebuilt my app in debug mode and started debugger and what ? Everything became clear.
The root cause is a critical section used to protect a common locale object. operator >>() inside creates a basic_istream::sentry object on the stack. Its constructor calls (through another method) ios_base::locale() which returns a std::locale object by calling its copy constructor (see syntax below). The copy constructor calls Incref() to increment a reference counter. Incrementing reference counter is surrounded by a critical section.
locale __CLR_OR_THIS_CALL getloc() const
{ // get locale
return (*_Ploc);
}
So, all streams compete for the same locale object! Moreover, critical section is created with spin count = 0. That means if one thread tries and fails to acquire a lock (enter critical section) while another thread is using it, it immediately goes into a sleep mode. When the lock gets freed the thread gets awaken. But all this is extremely expensive and therefore it creates that very overhead ! Should spin count be a non-null then it would run much faster – the spin count determines amount of tries to acquire a lock before the thread goes to sleep. For example, memory management routines (e.g. malloc()) use spin count of 4000 or so, and this makes multi-threaded apps run effectively even concurrently allocating memory. Why not to do the same for streams ?
OK, I tried to persist and talked to colleagues around. One of them gave me a link to http://msdn.microsoft.com/en-us/library/ms235505.aspx which among the rest discusses thread-specific locale. This looked promising. But after experimenting and reading http://msdn.microsoft.com/en-us/library/ms235302(VS.80).aspx I found this of no help :-(. The matter of the fact is that only C runtime locale can be made local to threads while C++ streams always use global locale::global. Gosh ! Microsoft ! Why ?!
So this is here I am now. I will keep on searching but if you ever dealt with using STL streams in multi-threaded environments or barely heard about this please let me know. I will appreciate any reference. The option to implement an own string parser (to avoid STL streams) is currently the least preferred but eventually is not excluded.
Nonetheless, experience with TBB runs extremely positively. I have integrated it into conversion part which converts transient representation into Open CASCADE shapes. Parallel execution and using of tbb::concurrent_hash_map for storage outperforms sequential implementation and will scale well over larger number of cores. I'll keep you posted on how it's going.
Take care,
Roman
3 comments
More news on this issue.
ReplyDeleteSTL provides an API to set individual locale objects (std::basic_ios::imbue()). However this brought very insignificant improvement (~10% speed up vs slowed down performance). The reason for that is that default constructors keep calling a global locale object so the problem is still there.
My nearest colleagues were amazed by this ugly bug and urged me to bring this to Microsoft. The post on MSDN was answered that this bug had been fixed in VS2008 SP1 with InterlockedIncrement() instead of using a critical section. We also added this into Intel Knowledge Base.
This case made me think again how dependent we are on a particular implementation of something you thought to be a standard. So, even if your code compiles across platforms just fine, execution behavior can be way too different. How does gcc implement streams ? Or Solaris compiler ? God knows. I also recalled STLPort as a cross-platform implementation but how certain can you now be it’s *really* cross-platform...
Oh dear! They did not fix it. Just installed VS2008SP1 and found the bug is still there:
ReplyDelete_CRTIMP2_PURE void __CLR_OR_THIS_CALL _Incref()
{ // safely increment the reference count
_BEGIN_LOCK(_LOCK_LOCALE)
if (_Refs < (size_t)(-1))
++_Refs;
_END_LOCK()
}
_CRTIMP2_PURE facet *__CLR_OR_THIS_CALL _Decref()
{ // safely decrement the reference count, return this when dead
_BEGIN_LOCK(_LOCK_LOCALE)
if (0 < _Refs && _Refs < (size_t)(-1))
--_Refs;
return (_Refs == 0 ? this : 0);
_END_LOCK()
}
The moderator on the MSDN forum confirmed they had missed it :-(.
Even worse. They won't fix it even in Dev 10 (and presumably SP's) - see http://connect.microsoft.com/VisualStudio/feedback/ViewFeedback.aspx?FeedbackID=492561
ReplyDelete