Open CASCADE and multi-threading again. B-Splines…
This a small continuation of a series on parallelism and Open CASCADE started last year. This post is just about a particular class but it is fundamental enough deserving a separate mention.
Geom_BSplineCurve, Geom_BSplineSurface, Geom2d_BSplineCurve, Geom_BezierCurve, Geom_BezierSurface, Geom2d_BezierCurve. They implement B-Spline and Bezier curves and surface.
Supposedly for optimization of calculation of points and derivatives, they use a cache that stores information used in calculation of the lastest B-Spline segment. In 2008, when I was working on Extrema performance improvements, I made some experiments about this cache efficiency. I tried several different workloads – Boolean operations, curve projections, IGES and STEP imports, and several others – but cache misses were ~50%-75%. That is, in more than a half cases, the cache had to be flushed out and recalculated. This made me wonder if this technique makes sense at all, if such computation-intensive algorithms do not really take advantage of it. Of course, to claim that I had to conduct more thorough experiments which I did not have time for. So I just left that issue alone. (But if anyone has some insights I would be interested to hear.)
The issue popped up in a year (December 2009 or early January 2010) when I was preparing a public release of the ACIS converter of CAD Exchanger. Until that it was parallelized and took advantage of multi-core machines performing translation in multi-threaded mode. However on some models from time to time I was receiving spontaneous crashes which made me wonder about root-causes. As symptoms looked as data races, I launched Intel Parallel Inspector to see if it's true.
What was my surprise when I saw that the root-cause was ... Geom_BSplineCurve::D0() ! My first impression was that Inspector was probably too jealous about OCC and is plain wrong, as D0() is just calculation of a point on the curve. Moreover it's a const method. How come a data race can be between two const methods ? But looking deeper at stacks, source code and applying some brain power, I realized that. Yep! D0() first casts a const pointer to a non-const pointer and calls a non-const method:
Geom_BSplineCurve * MyCurve = (Geom_BSplineCurve *) this ;
MyCurve->ValidateCache(NewU) ;
So, what does that mean ? It means that when the same B-Spline curve is used in 2 or more threads they read/write the cache without synchronization, thereby overriding data and creating a data race. Obviously, the value returned by D0() can be totally wrong what leads to subsequent crashes in the algorithm that calls it (e.g. projection of a 3D curve on a B-Spline surface). As this situation (simultaneous use of the same surface object in different threads) is rather rare, crashes did not always happen.
I did not have time before that release to fully solve the problem, so had to simply disable parallelism in the ACIS converter. Now I'm returning to this issue and I'm going to introduce object copying before use in multiple threads (what would add some overhead but this will be a 'cold path' as such cases would be still be rare).
So if you consider use of Open CASCADE in multi threads beware of this potential issue. Make sure you have protected access (e.g. copy the objects, access synchronization, etc) or independent copies of the B-Splines.
4 comments
Wouldn't it be easier to use TLS (Thread Local Storage) instead, almost all compilers support it today and you don't need to change anything in the algorithms.
ReplyDeleteMicke, thanks for the comment. These B-Splines (like any other results of translation) are shared objects (among any threads throughout entire translation time), not specific to any threads. Therefore TLS has nothing to do with it.
ReplyDeleteHello Roman,
ReplyDeleteI believe right approach would be to provide separate evaluator class to perform optimized (cached) calculation of B-Splines, to be used locally in places where it is important for performance. For general case of near-to-random evaluation of B-Splines creation of relevant data on demand (inside Value() functions) can be sufficient.
Andrey
Hi Andrey,
ReplyDeleteThanks for a comment. Well, even today, I believe, if calculations exploit locality of the points (i.e. conducted calculations in proximities to each other) current approach is fine ... from single-thread perspective. If there are random or distant calculations (like ranges of the surface/curve, etc) cache miss ratio is high and it does not bring much value. For threaded applications this makes it almost unusable 'as is'. What could also be helpful is a user-defined parameter that would control use of cache. By default, it could enforce using it (to preserve current behavior) but the developer could opt-out and avoid using it. In concurrent applications this may be more relevant as total throughput could be higher due to avoidance of copy and/or syncrhonizations even when individual calculations are performed without caching.