Extreme fight with Extrema. Part 1

Those who regularly follow my blog likely remember that after recent speed up of Boolean Operations (see Why are Boolean Operations so slooo...ooow?) there remained a test case from Pawel K that almost was not affected. So I accepted a challenge and spent more time on this. Many thanks to Pawel for providing these models. Though they appeared to be very peculiar (such as tiny ellipse arcs and huge B-Splines of 300+ poles and 40+ knots) they allowed to detect a problem with wider impact.

Bottom line first: achieved speed up is 2.3x (from 113s to 49s). Not bad, but not quite a breakthrough because elapsed time is still in order of dozens of seconds. There are some (not obvious) further directions for optimization but they will likely be even more time-consuming, so I am not yet ready to commit to that (moreover as I already spent half of my vacations ;-)).
But what is more important is that made modifications open a door to speed up other OCC algorithms, as modified was the core component – Extrema. Today's story is about changes in it…

So, as usual, I used Intel Parallel Amplifier (by the way, we just RTM'ed, Released to Manufacturing, the version for public beta which should be available in January) to understand hotspots.

They were totally different than in the cases I dealt with in my previous experiment and it promptly became obvious they related to edge-to-edge intersections (see stack on the right).

Looking at IntTools_BeanBeanIntersector::ComputeUsingExtrema() it became understandable what was happening.

Each pair of edges which is a candidate for intersection (e.g. if their bounding boxes intersect) is analyzed. Each edge is split into certain number of ranges (e.g. ellipse – into 40) and then these ranges are analyzed with the other edge's ranges. And though inside ComputeUsingExtrema() the 2nd range does not change, the Extrema object (used to calculate the smallest distance) is created from scratch and initialized with it every time. Diving further with the debugger into Extrema, I found out that B-Splines are split into C2 intervals (which on Pawel's model were about 20!). Along each curve interval, a few sample points (32 by default) are taken. Distances are measured on the fly and candidates are further passed to math_Function* for precise calculations. Inside it the number of sample points is sometimes magically increased by 2x (just for more reliable sampling, ah ?) So, tons and tons of calculations without any attempt to reuse what already been calculated some time before.

The main idea for optimization was obvious – cache and reuse. This would require some change in Extrema API (as it did not support that) which I eventually did. It is currently limited to 3D curves only, similar improvements for other cases are deferred until OCC folks (who I sent the fixes today to) can confirm there are no regressions.

But before proceeding to API redesign, I decided to make finer improvements to make sure tiny bottlenecks do not go away from radar due to larger-scale improvements.

(To be continued).


  1. I'm used with Kcachegrind, a decent performance profiler: http://kcachegrind.sourceforge.net/html/Home.html which goes pretty decent on somelike workloads. Excluding that is Linux only (a platform that OpenCascade it supports, and Intel too ;).

    My question is: how much overhead did bring OCC instrumentation?

  2. I have not tried Kcachegrind and it was not in our list at Intel to compare by far.
    One of competitive advantages of Amplifier is that it does NOT instrument the code (and so there are no any special preparation of binaries).
    In hotspot analysis mode overhead is <5-10%.