Why are Boolean Operations so slooo...ooow ? Part 3
(continued)
For my experiments I used my own patched version of Open CASCADE 6.3.0 that accumulated all the modifications for prototyping multi-threading I mentioned earlier on the forum. One set of the modifications in it are thread-safe versions of BSplSlib, BSplCLib and PLib. Original 6.3.0 uses static array which are re-used by further calls to ::D0(), D1(), etc what is obviously not thread-safe (or even non-reentrant), so a few months back I changed them to use local variables which would be allocated/destroyed upon every relevant *Lib function call. This was just fine for IGES import scenarios but working on BOPs I noticed that my code revealed performance regressions to 6.3.0. See the image below:
So it was obvious that constant allocations/destructions were not an option. I checked how many times reallocation (or actually allocation) was used on a BOP test case, and it was 11.7 millions; allocated buffers were from 2 to 264 doubles long. Well, I had to return back the previous approach (i.e re-allocation in the case when new requested buffer exceeded previously allocated one). But how to ensure re-enterability ? The answer was quite obvious – TLS, or Thread-Local Storage. That is, each thread has its own buffer, and it's only used by that thread. So, I wrote a class (let’s tentatively name it TLSManager) that contains a map (hash-table) of {Standard_ThreadId, set of buffers}, and returns a requested buffer depending on a thread id from which a buffer was requested.
Another obvious problem poped up – TLSManger must be thread-safe but using Standard_Mutex to protect it would be an overkill. There is a solution to this typical problem, which is a read-write-lock, i.e. an object that allows multiple concurrent read-accesses to a shared resource (in our case a map with buffer sets) and exclusive write-access (when a new set of buffers is created for a new thread). Well, I went and re-invented a wheel, and added a class OSD_ReadWriteLock. Doing this (like when adding other thread classes in OSD a while ago) I once again thought that OCC should rather bring in Boost (www.boost.org) than re-designing own wheels. Salome does use Boost, so OCC can obviously too. For instance, Boost also offers a template for TLS - http://www.boost.org/doc/libs/1_37_0/doc/html/thread/thread_local_storage.html.
Once the RWL class has been written, I created a simple test case to check it and used Intel Parallel Inspector for that. Inspector allows to identify memory errors (leaks, uninitialized memory use, etc) and thread errors (data races, deadlocks, etc). Thread checker is one of a kind (there are now other software as far as I know) but its overhead is substantial.
Well, this session with Inspector that was something! It reported data races, as if my RWL were simultaneously read and wrote into the same memory (object member).
I spent several hours and felt like a full dumb looking at my one page code trying to root-cause the errors and beating my head over the keyboard and anything else around. Crazy Friday evening at home! When I gave up, I recompiled my unit test to use Qt’s QReadWriteLock, and what ? Same data races!
That made me doubt even further; I wrote down all behavior scenarios on a paper sheet and reaffirmed that there cannot be any data races, everything was protected. It’s just plain crazy false positive (i.e. a report on a problem that actually does not exist)! I know that Inspector has false positives issues but I could not imagine it would beat me that much. So, I am looking forward to talking to my Intel colleagues about that. (Make a note for you, when you download Inspector and notice the same problem, you might want to recall my case – perhaps it will be some unfixed false positive ;-)).
OK, now being confident in my RWL and I went and tried TLSManger with RWL in BSplSLib. And ? The regression has gone ! This small overhead for RWL use (instead of former shared static buffers) became actually unnoticeable. Excellent!
With all those modifications, overall speed up vs 6.3.0 was about 4x on Open CASCADE test case. I tried a simpler model sent by forum participants, and it revealed 20x speed up. Very small cases running at a fraction of a second did not reveal substantial speed up. Thus, in general we can roughly project speed up in 3x-10x range in average.
There is still something to do, e.g. to design TLSManager in CDL and migrate PLib and BSplCLib to it. I will do this as time permits, hopefully sooner while my memory is fresh.
And, by the way, if you want to try out your models with a new version, please feel free to send me the models via email or a download link. Those who eager to get the fixes, just let me know ;-)
Looking back, I think that time spent on it was worth it. I do hope that these findings will inspire the OCC team to dig further and to find further rooms for improvements, beyond BOP. I hope that my colleagues at Intel will appreciate 14 bug reports and enhancement requests I compiled during these days, and that by a commercial release the tools will be even better than they are today. I was able to learn something new in depths of OCC Modeling algorithms, and this was good. Folks from the Community will benefit via future OCC releases that would hopefully include my modifications.
I will continue to prepare OCC test cases for app testing, and if there is anything interesting, I will share with you.
Let me add a few more words on performance. Being at Intel I now view it a bit differently than when working at a software development company. Guys, times are changing (or already changed if you want). Free lunch, when your app would run faster just with every new released processor, is over. Megahertz era is over. To make your application run faster you must make it multi-threaded and scalable. Performance is not just that your app runs faster. Higher performance means more features. Look at spell checker in MS Word – it checks as your type your document. It’s just because it’s fast enough and because it runs in a parallel thread.
If you want to stay competitive in the market, you must parallel. No other way. It’s challenging but fortunately there are tools to help with that. And I am happy that somehow I relate to them. Go and try Intel tools (www.intel.com/go/parallel).
Endorsement ? Well, may be. Sincere ? Absolutely ! (I practice what I preach ;-) )
Good luck !
(end)
P.S. Please rate this article using the voting buttons below.
18 comments
Pretty nice!
ReplyDeleteI'm eager to get the fixes (now you know :))
Hello Roman,
ReplyDeleteVery glad to know that your fighting with OCC gives you some advance in your work, besides satisfaction from winning a hydra. :)
Your article and results are quite impressive. Just I have a couple of comments:
- As I guess, arrays in BSplCLib mentioned above (part 3) are most likely used for polynomial expansion of spline segment for its evaluation. In this case buffer could be allocated as automatic local array, since maximal degree of b-spline, and hence maximal number of polynomial coefficients, is limited in OCC. In my opinion, using TLS looks like an overkill here...
- For the problem of dynamic arrays (mentioned on page 2), did you considered using NCollection_Vector? Why creating a new class (IntPolyh_DynamicArray) if it could be possible to use (and, perhaps, improve) existing one?
Andrey
Hi Roman,
ReplyDeleteYou did really the excellent job. My congratulations.
One note I have. This work was not connected with parallelism of the code (at least what's concerning bottlenecks in BOP). So it could be done using any other profiling tools, like IBM Rational Quantify. Could you describe benefits of the Intel tools with compare to other profilers, when working in single thread environment.
Michael
Great and interesting articles....!
ReplyDeleteI'm loading the Composer beta version and so I'm eager to get the fixes too ;)
ReplyDelete2 all:
ReplyDeleteThanks to all for feedback and voting. Please don't hesitate to vote - I'm really interested to see how many people read the block and have opinions on the contents.
2 Andrey:
Great comments (as usual)! Keep them up. We talked with Andrey offline, and I will try to check his ideas. I'll be happy to withdraw new classes and reuse existing ones to keep the code leaner.
2 Michael:
Parallel Amplifier has great features of Concurrency Analysis and Waits&Locks analysis (which are indeed vital when making an app multi-threaded - they were instrumental when I was parallelizing IGES import). Concerning Hotspot and in general, I would name result comparison (when you can compare results received on different runs, e.g. different versions of your code), and lack of instrumentation. The latter means that your code does not need to be pre-processed (instrumented) before its profiling. I must confess that I have not been using Quantify for 9 years, so wouldn't claim these are unique features. And yes, ease of use is also good. We applied significant efforts trying to make it easy-to-use (after having legacy products quite complex). Anyway, as usual this all is my personal opinion, not an official statement.
2 Ceniza and Pawel:
Folks, let me first try switching to NCollection_Vector and removing TLS (as inspired by Andrey), to see if it works. Anyway, the fixes are multiple and include changes in CDL. You will have to apply some efforts to recompile OCC with them. This will take me some days for sure (we are working hard now to finalizing the bits before Public beta).
2 Pawel:
Composer should be available now or shortly. It includes compiler and libraries (MKL, IPP, TBB, OpenMP). I don't know if OCC has ever been tried to be compiled with Intel compiler (I think this should be put on TODO list by OCC folks). Regarding TBB I also think this can be some mid-term option to consider for OCC.
Great articles, Roman! Thanks a lot! I will be following your blog from now on regularly.
ReplyDeleteI will definitely take a look at Intel tools. We have problems with performance and memory management in our OCC-based application, and these problems are becoming critical. Your blog gives us a hope ))
Folks, I’ve posted the fixes in IntPolyh on SourceForge here. They are very isolated and do not depend on/affect other modifications (including BSplSLib, etc). Though I had a look at an option to migrate to NCollection_Vector as Andrey suggested, I did not do so, as it will likely lead to performance degradation (due to _Vector memory allocation strategies). I have suggested to the OCC folks that they do it step-by-step and migrate later, if required.
ReplyDeleteThe fix structure is straightforward – just unzip into %CASROOT%. If you are using WOK, regenerate from cdl files. If not then you will have to modify your configuration files (TKGeomAlgo.vcproj for your MSVS or automake-related files to include ./drv/IntPolyh_*0.cxx files). Then rebuild TKGeomAlgo. Source and binary compatibility with other libraries is not broken.
If you have any problem, post a comment here.
Can you do me a favor ? I need your feedback! In *any* case – improvement, regression, or no change. Send me your models and BOP operation you apply. This will give a broader test database.
Thanks and good luck!
Good news! The latest build of Inspector fixes the problem of false positives on RWL mentioned in part3. So public beta should be without this bug exposed.
ReplyDeleteThis comment has been removed by the author.
ReplyDelete2 Solar Angel:
ReplyDeleteNope, this is intentional, to overwrite previous explicit IntPolyh_Array* classes which are now instances of IntPolyh_DynamicArray. This will remove conflict during linking (duplication of symbols).
Another approach is to remove these files from your make files (*.vcproj, etc) previous IntPolyh_Array*.cxx.
This comment has been removed by the author.
ReplyDeleteHi, Roman, thank you. However, I cannot integrate your fix into the library. I'm using VC++6.0 and I met many compile errors. I did all that you described. It seems that there are some missing header files. You uploaded only two files. IntPolyh_DynamicArray.gxx and IntPolyh_DynamicArray.lxx. Are they enough?
ReplyDeleteYeah, thanks ! A few header files were missing. Now added and re-posted as a new complete archive fix203.1.zip. Those who downloaded the previous file should download and re-install the new one. Sorry, guys. My fall up.
ReplyDeleteHi Roman,
ReplyDeleteFor a beginner, because I suppose all others using your fixes know already that, can you please explain how can we get the cxx and hxx files from gxx, lxx and cdl files.
Thank you!
Hi,
ReplyDeleteThe fixes are already in the form ready for compilation. CDL's have been extracted with WOK to produce .hxx, and drv/*_0.cxx.
Just go and copy them into your OCC installation overwriting some existing files. What you will need to do is to add drv/*_0.cxx files into VS projects (e.g. for VS2005) you use to build OCC.
Hi Roman,
ReplyDeleteThank you for the quick answer! It helped. Now I'm looking forward to testing my cases with boolean operations.
Best regards.
Hello Roman,
ReplyDeleteI use OpenCASCADE-6.3.0 and MS Visual C++ Express edition. I'm too have problem with very slow Boolean Operatins. So, a find your blog and read it all.
I took from sf.net your files:
OCC631-fix100-355-win32-vc7.zip
OCC631-fix100-355-src.zip
OCC630-fix200-299.zip
OCC630-fix100-199.zip
First I try to just copy file "TKGeomAlgo.dll" from "OCC631-fix100-355-win32-vc7.zip" to my OCC. Byt, I have many link errors in my application. I think it because of different versions of OCC.
I try to copy source files from "drv/IntPolyh/*" and "src/IntPolyh/*" from "OCC631-fix100-355-win32-vc7.zip", I replace existing files and add new files to projet for build new "TKGeomAlgo.dll". But, I have many link errors when try to build "TKGeomAlgo.dll". So, may be it also because of different versions of OCC (my 6.3.0, archive is for OCC631).
So, then I look at "OCC630-fix200-299.zip".
There I don't find drv/*_0.cxx files. So, tell me please how can I generate this files from WOK (I don't know how this work).
Denis.