2009-12-21

Surface modeling. Part4

(continued...)
Constructing arbitrary sweep surfaces
GeomFill algorithms considered in the previous posts internally use Approx_SweepFunction subclasses that define particular techniques to calculate cross-sections.
You can create your own subclass to model a particular sweep surface. Approx_SweepFunction has a few pure virtual methods which your subclass must redefine. For instance, in ACIS-SAT import for CAD Exchanger I use it to generate variable-radius surfaces. In ACIS, such surfaces are defined with the help of so called support surfaces (left and right), path curve, left and right radius functions, cross-section forms (circular, elliptical, chamfer, etc). Below is an example of such surface (in dark yellow) with elliptical section; support surfaces are in green and bright yellow, and path is red.



Here is an exceprt of the method that calculates a cross section at parameter along the path:

Standard_Boolean ACISAlgo_VarBlendSweepFunction::D0 (const Standard_Real theParam,
const Standard_Real,
const Standard_Real,
TColgp_Array1OfPnt& thePoles,
TColgp_Array1OfPnt2d&,
TColStd_Array1OfReal& theWeights)
{
gp_Pnt2d aLRad, aRRad;
// left radius
aLRad = myLRad->Value(theParam);
// right radius
if (myRRad == myLRad) {
aRRad = aLRad;
} else {
aRRad = myRRad->Value(R2Param);
}

gp_Pnt aPathPnt;
gp_Vec aNormal;
myPath->D1 (theParam, aPathPnt, aNormal);

const gp_Ax2 Axis (aPathPnt, aNormal);
const gp_Pln aSecPln (Axis);
Handle(Geom_Plane) aGSecPln = new Geom_Plane(Axis);

//intersection of a section plan with support surfaces
Handle(TColGeom2d_HArray1OfCurve) aLCArr, aRCArr;
if (!::Intersect (aGSecPln, aSecPln, myLSurf, aLCArr) || !::Intersect (aGSecPln, aSecPln, myRSurf, aRCArr))
return Standard_False;

//compute a section in 2D and restore it into 3D
const Standard_Integer n = thePoles.Upper();
TColgp_Array1OfPnt2d aPArr (1, n);
if (!mySec->ComputeSection (Axis, aLCArr, aLRad, aRCArr, aRRad, aPArr, theWeights))
return Standard_False;

for (Standard_Integer i = 1; i <= n; i++) {
thePoles.SetValue (i, ElCLib::To3d (Axis, aPArr (i)));
}
return Standard_True;
}

The above redefined virtual method D0() populates poles and weights arrays of the B-Spline curve which defines a cross-section.

Here is how to construct your surface using Approx_SweepFunction subclass:
...
Handle(ACISAlgo_VarBlendSweepFunction) aSweep = new ACISAlgo_VarBlendSweepFunction (...);

Approx_SweepApproximation anApprox (aSweep);
const Standard_Real aTol = 1e-4;
anApprox.Perform (aHPath->FirstParameter(), aHPath->LastParameter(), aTol, aTol,
Precision::Parametric (aTol), 1-3,
(GeomAbs_Shape)Min (GeomAbs_C1, aHPath->Continuity()), BSplCLib::MaxDegree(), aMaxSeg);
Handle(Geom_Surface) aRes;
if (anApprox.IsDone()) {
aRes = new Geom_BSplineSurface(anApprox.SurfPoles(),
anApprox.SurfWeights(), anApprox.SurfUKnots(), anApprox.SurfVKnots(),
anApprox.SurfUMults(), anApprox.SurfVMults(),
anApprox.UDegree(), anApprox.VDegree());

I guess using the above technique you can create arbitrary sweep surfaces, e.g. like the one below:




Topological algorithms
As mentioned in the first post of this series, as a rule Open CASCADE offers algorithms both at geometrical and topological level. Until now I was mainly focused on the geometrical level (just because I myself used it :-)). Let me try to give a brief mapping with a topo one:

1. Sweep of 1 curve along the other (including pipe with constant section)
a. GeomFill_Pipe (path, profile)
b. BRepOffsetAPI_Pipe (path, profile)
2. Pipe with constant radius (tube-like):
a. GeomFill_Pipe (path, radius):
b. BRepOffsetAPI_Pipe (path, profile) where profile must be pre-created
3. Pipe with constant radius with rail curves
a. GeomFill_Pipe (path, radius, rail1, rail2)
b. Unsupported ?
4. Pipe with variable radius
a. GeomFill_Sweep (path, radius_function)
b. Unsupported ?

Note that topo algorithms can use elements of diffent dimensions for profile object. This defines a type of a result. For instance, a pipe of vertex will generate an edge, edge – face, wire – shell, face-solid, etc.

Brief note on BRepOffsetAPI_MakePipeShell. As it works with wire spines, it must be able to deal with discontinuties between the edges (when edges intersect with sharp edges). It offers 3 modes:
- intersection and rounding
- extension and intersection
- modifying trihedron mode used when sweeping a profile along the following edge in a spine.
The following picture shows these modes respectively in red, green and blue:



To be continued...

2009-11-30

Surface modeling. Part3

(continued...)

Pipe with constant radius
A particular case of a pipe with a constant section is a pipe with a constant radius. In ACIS such surfaces are called tubes.

Here is a sample screenshot:



Here is an example of creation:

GeomFill_Pipe aTube (thePath, theRadius);
aTube.Perform (aTol, Standard_False, (GeomAbs_Shape)Min (GeomAbs_C1, thePath->Continuity()), aMaxDeg, aMaxSeg);


Pipe with constant radius and two rail curves
As a convenience, the pipe algorithm allows to specify two rail curves. Rails are those which limit the cross section.
This algorithm flavor can be used to model so-called rolling ball surfaces. ACIS defines rolling-ball surfaces as if you have a ball of a constant radius that rolls along the path and always touches two limiting faces. The traces that the ball makes on those boundary faces are called spring or rail curves.

Open CASCADE algorithm accepts the radius, the path and two rail curves and creates a surface with limited circular sections. Each section is constructed by intersecting a plane at every path point and perpendicular to it with rail curves.

Here are sample screenshots of rolling ball surfaces:




On both images the pipes are shown in red and rail curves in blue. The 2nd screenshot also contains a full tube of which a fragment is constructed using 2 rail curves. This tube is a trace that a full rolling ball would make.

Important note to make is that Open CASCADE requires that rail curves follow the path parametrization. This means that the rail curves' ranges must be at least as big as the path's and be consistent between each other.


Pipe with variable radius
In addition to constant radius, you might want to create tubes (i.e. pipes with circular sections) with variable radii. For instance, like this:




Open CASCADE does not offer direct API to construct such surfaces but you can do these using lower level API it offers. For instance, here is my code excerpt:


/*! Set radius evolution function with SetEvol() before calling this method.

If \a theIsPolynomial is true tries to create polynomial B-Spline, otherwise - rational.

\sa Surface(), Error().
*/
void ACISGGeom_Pipe::Perform (const Standard_Real theTol,
const Standard_Boolean theIsPolynomial,
const GeomAbs_Shape theContinuity,
const Standard_Integer theMaxDegree,
const Standard_Integer theMaxSegment)
{
mySurface.Nullify();
myError = -1.;

if (myEvol.IsNull())
return;

//circular profile
Handle(Geom_Circle) aCirc = new Geom_Circle (gp::XOY(), 1.);
aCirc->Rotate (gp::OZ(), PI / 2.);

//code inspired by GeomFile_Pipe when using for constant radius and corrected
//trihedron orientation

//perpendicular section
Handle(GeomFill_SectionLaw) aSec = new GeomFill_EvolvedSection (aCirc, myEvol);
Handle(GeomFill_LocationLaw) aLoc = new GeomFill_CurveAndTrihedron (
new GeomFill_CorrectedFrenet);
aLoc->SetCurve (myPath);

GeomFill_Sweep Sweep (aLoc, myIsElem);
Sweep.SetTolerance (theTol);
Sweep.Build (aSec, GeomFill_Location, theContinuity, theMaxDegree, theMaxSegment);
if (Sweep.IsDone()) {
mySurface = Sweep.Surface();
myError = Sweep.ErrorOnSurface();
}
}

In this case myEval is Handle_Law_BSpFunc object constructed from 2D B-Spline which defines radius evolution:

/*! Creates an internal Law_BSpFunc object which represents an evolution function. Uses X
coordinates of the \a theEvol B-Spline curve.

\a theFirst and \a theLast are boundaries of the path curve.
*/
static Handle(Law_BSpFunc) CreateBsFunction (const Handle(Geom2d_BSplineCurve)& theEvol,
const Standard_Real theFirst,
const Standard_Real theLast)
{
//knots are recalculated from theEvol prorate to [theFirst, theLast] range
Standard_Integer i;
const Standard_Integer aNbP = theEvol->NbPoles();
TColgp_Array1OfPnt2d aPArrE (1, aNbP);
theEvol->Poles (aPArrE);
TColStd_Array1OfReal aPArr (1, aNbP);
for (i = 1; i <= aNbP; i++)
aPArr(i) = aPArrE(i).X();

const Standard_Integer aNbK = theEvol->NbKnots();
TColStd_Array1OfReal aKArrE (1, aNbK), aKArr (1, aNbK);
theEvol->Knots (aKArrE);
TColStd_Array1OfInteger aMArr (1, aNbK);
theEvol->Multiplicities (aMArr);

const Standard_Real aKF = aKArrE(1), aKL = aKArrE (aNbK);
const Standard_Real aKRatio = (theLast - theFirst) / (aKL - aKF);
for (i = 1; i <= aNbK; i++) {
aKArr(i) = theFirst + (aKArrE(i) - aKF) * aKRatio;
}

Handle(Law_BSpline) aBs;
if (theEvol->IsRational()) {
TColStd_Array1OfReal aWArrE (1, aNbP);
theEvol->Weights (aWArrE);
aBs = new Law_BSpline (aPArr, aWArrE, aKArr, aMArr, theEvol->Degree(),
theEvol->IsPeriodic());
} else {
aBs = new Law_BSpline (aPArr, aKArr, aMArr, theEvol->Degree(), theEvol->IsPeriodic());
}

Handle(Law_BSpFunc) aFunc = new Law_BSpFunc (aBs, theFirst, theLast);
return aFunc;
}

/*! Uses X coordinates of the \a theEvol B-Spline curve to set evolution function.
*/
void ACISGGeom_Pipe::SetEvol (const Handle(Geom2d_BSplineCurve)& theEvol)
{
myEvol = ::CreateBsFunction (theEvol, myPath->FirstParameter(), myPath->LastParameter());
}


Below are examples of radius function (as 2D B-Spline) and resulting surface:





Common comments
Pipe surfaces are parametrized in U along the cross section and in V along the path. Surface parametrization inherits the path range and adjusts U to a section range. For instance, tubes are parametrized from 0 to 2*PI in U.

To be continued...

Invitational Beta for ACIS-SAT support just started

ACIS-SAT format support now available in CAD Exchanger and for any other Open CASCADE-based application

CAD Exchanger has been extended with support of ACIS-SAT import and export. After several months of hard work and thorough testing this new format is now available to our users.
CAD Exchanger supports ACIS from version R1.5 to the latest R20 released in 2009. The scope includes a full set of geometrical and topological objects.

For developers on Open CASCADE, the ACIS-SAT plugin can be delivered as SDK that can be directly integrated with Open CASCADE-based application using either TopoDS_Shape interface or a BRep file.

We have conducted thorough testing through the database of 1000+ models before this announcement, so you should not be annoyed by any severe bug. But there can be some corner cases which could potentially reveal some issues. So please let us know if you find anything. But if you just like the product and it works fine in our case please tell us this as well. We need your feedback !

To get immediate download access please email us at info@cadexchanger.com.

Public Beta will be made available early next year after addressing feedback from our most active users. Sign up now to get your voice heard !

2009-10-26

Surface modeling. Part2

Sweep surfaces
Sweep surfaces are constructed using a spine and a profile that moves along it.
Here is a screenshots of a typical sweep surfaces:



The sweep surface is constructed using GeomFill_Pipe. Perhaps the name pipe stems from the fact that a particular case when the profile is closed produces a pipe-like surface.

GeomFill_Pipe Pipe;
GeomFill_Pipe aPipe (aPath, aProfile, GeomFill_IsFixed);
aPipe.GenerateParticularCase(Standard_True);
aPipe.Perform(aTol, Standard_False, GeomAbs_C1, BSplCLib::MaxDegree(), 1000);
const Handle(Geom_Surface)& aSurface = aPipe.Surface();

This code is an excerpt from the CAD Exchanger, the translation driver for ACIS sum_spl_sur, which is defined as a sum of two curves.

By default, the sweep surface is created as a B-Spline, either rational or polynomial – depending on the parameter in the Perform() method. If you want to generate elementary surface (torus, cylinder, sphere, etc) when curves configuration allows, then call GenerateParticularCase() with Standard_True.

The algorithm can also return an approximation error – use ErrorOnSurf() to get it.

Sweeping is constructed dragging a profile along the spine and modifying its orientation along the latter. This behavior is controlled by a parameter of the type GeomFill_Trihedron. The following images illustrate how resulting surface is different for the same spine and profile (semi-circles):


GeomFill_IsFixed


GeomFill_IsFrenet


GeomFill_IsConstantNormal

You can experiment in DRAW using the 'sweep' command and providing various options.

Pipes
GeomFill_Pipe offers a few pre-defined construction techniques to construct a sweep surface:
- a pipe with constant section;
- a circular section pipe with constant radius;
- a circular section pipe with constant radius with two rails.

Pipes with constant section has been considered above. Here are two more examples of such pipes:


To be continued...

2009-10-10

Surface modeling. Part1

Surface modeling is a fundamental feature of any 3D geometric modeler. As you know, Open CASCADE offers a set of elementary surfaces (planar, conical, spherical, etc), Bezier and B-Spline, revolved, extruded and offset surfaces. There is also a trimmed surface which trims an underlying surface in parameter space.

Open CASCADE implements a subset of STEP (ISO standard 10303, part 42) for geometrical and topological entities, though with some minor deviations.

The surface object only contains a final geometrical representation and does not provide any information on how it was created. This differentiates it, for instance, from ACIS which uses a notion of so called procedural surfaces which may contain both a construction technique and an optional final approximation. For instance, a skin surface comes with description of a set of section curves the surface was 'skinned' through and resulting NURBS approximation. This leads to increase in number of entity types to be supported by the modeler and likely complexity of modeling algorithms that use them. (This also makes me suffer developing extra classes to represent all this variety in the CAD Exchanger translator and to translate them into Open CASCADE. I can successfully re-import all SAT files exported from Open CASCADE but not all possible SAT types yet). By the way, if there are readers who are familiar with ACIS their comments would be valuable to check how OCC capabilities compare with those of ACIS.

Open CASCADE favors different approach where knowledge of the performed modeling algorithms is stored elsewhere (e.g. in OCAF using function drivers). The B-Rep model only contains a result of an operation and thus is more compact.

Another point to make is that OCC offers algorithms both at geometry level (dealing with Geom_Surface and/or Geom_Curve objects) and topology level (TopoDS_Shape subclasses). The latter may use the former but this dual API is not always provided. Some algorithms are only available at the geometry level and some are only at topology. (If you are not certain about this distinction please make sure you re-read a series of posts earlier this year).

Let's see what modeling techniques you can use with Open CASCADE. I bet there is no point in going through description of construction techniques of elementary surfaces. Documentation and header files are just enough for that. Let's rather check what kind of advanced stuff OCC has.

Ruled surfaces
Ruled surface is constructed by connecting two curves (i.e. points along them) with lines. Particular case of a ruled surface is plane (as it can be built on two parallel lines). If you take two parallel circles then a ruled surface will be a cylinder or a cone.

Here is how it may look in general case:



Last spring, when we were in Spain and visiting Sagrada Familia in Barcelona (the central city cathedral, which is an on-going construction for several decades), there was an exhibition of Antoni Gaudi's construction techniques. Gaudi was an architect and he reused many techniques from nature. Among those there was a use of ruled surfaces – see some photos here or even read a paper 'Gaudi and CAD'.


You can create a ruled surface at geometry level as follows:
Handle(Geom_Curve) aCrv1 = ...;
Handle(Geom_Curve) aCrv2 = ...;
Handle(Geom_Surface) aSurf = GeomFill::Surface (aCrv1, aCrv2);

If you dealing at topology level you can create either a face using two edges or a shell using two wires. You need to use BRepFill:
TopoDS_Edge anEdge1 = ...;
TopoDS_Edge anEdge2 = ...;
TopoDS_Face aFace = BRepFill::Face (anEdge1, anEdge2);

TopoDS_Wire aWire1 = ...;
TopoDS_Wire aWire2 = ...;
TopoDS_Face aShell = BRepFill::Shell (aWire1, aWire2);

Here is a shell of two faces lying on ruled surfaces:



When using BRepFill::Shell(), wires must contain the same number of edges. If not you may need to re-approximate the edges. For instance you can reuse ShapeAlgo_Container::HomoWires() or create some similar algorithm, or re-approximate a wire using BRepAdaptor_CompCurve adaptor and Approx_Curve3d. The latter will produce a single B-Spline curve from a wire, which you can later use with GeomFill or create a TopoDS_Edge from of it to use BRepFill.

To be continued...

2009-10-09

Back again

Hi folks!

It took me a few weeks to get back to writing a new post, though I keep on reading most forum posts.

First, I wanted the latest post about seminar on parallel programming and Open CASCADE to stay a bit longer to let more people notice it. There were several feedbacks from the readers, all extremely positive and supportive. Many thanks again to those who replied, that was important. However, the number was still too small to justify the event so we will have to defer it for some time. There will be free webinars by Intel so those who are interested can participate online.

Many other things have been happening during this time. First, I have changed a position in Intel. My new job is engineering management of a local team that develops Threading Building Blocks and Open MP, two run-time libraries that are aimed at facilitation of creation of multi-threaded applications. Both are part of Parallel Composer, and the former is also available in Open Source. If you had been following my recent posts, I was describing my experience with TBB, which I became a fan of.

The team is quite mature and involves senior professionals, and the technical stuff is still quite of a challenge for me, so I continue ramping up. It turned out that my earlier experiments with TBB appeared to be a good upfront investment that now starts paying off. In general, throughout my career I observe that in order to successfully manage engineers you need be a good engineer ;-). You need to speak one language with people you work with.

Second, developing the ACIS-SAT translator for CAD Exchanger has also been time-consuming. The progress has been good so far and the exporter is feature complete, while the importer is mostly complete.

There were several interesting notes while working on the ACIS translator. I have adopted the Qt test framework for unit and integration testing, and was again fascinated how effective early testing can be. Indeed, the cost of a fix for a bug found during unit testing is just a small fraction of one found by your customer a few months later. I am trying to develop a discipline of having a test case for any new functionality, or even to have a test before the code is developed. An overhead that pays off in the future.

I thought what could be interesting for you, my readers. Though there are some draft notes made during last months let me try to offer you something new. I thought to share with you some techniques on surface modeling which could be done with Open CASCADE. I had not been working with it for years and don't remember many details but touched this topic now when developing translation of ACIS surfaces. So I may need some time to gather and to structure information on this topic. Answers can be slow as I will be studying some of these issues with you ;-).

So, let me start a separate series on this...

2009-08-24

Seminar on parallel development and Open CASCADE

Folks,

Today I would like to come with a question to you.

As you probably noticed, in several last posts, I was uncovering multi-threading programming issues. This subject has really caught me and applying it in Open CASCADE-based software reveals huge potential. Talking to my readers, on this blog and outside, I conclude that the issue of parallel programming and performance (regardless of Open CASCADE) is important for many developers and their employers. So, I thought to come out to you with the following idea.

What would be your thoughts on organizing some sort of technical seminar/training on parallel development ? This could be combined with training/consulting on Open CASCADE if there is a need. Regarding the former part (multi-threading) I would work with my colleagues at Intel to get support. Intel is very serious and working hard to promote this knowledge among mainstream software developers, and likely this idea will be supported if there is sufficient audience. Regarding the latter (Open CASCADE) I could check with Open CASCADE team if this will be of interest of them.

The challenge is how to find a place in the world for us to meet. Obviously, it will depend on locations of those who will respond.

So could you please think of this idea and let me know if you might be interested in such an arrangement. Certainly there will be some expenses involved (traveling and lodging) but all together we'll try to figure out how to minimize them.

Sounds too crazy? Post your comments with any ideas here or just send me an email at roman.lygin@gmail.com.

Thanks a lot in advance ! Let's make it happen !!!

2009-08-07

const Handle & vs Handle

Quick post. The addressed issue may seem quite obvious for professional C++ developers yet it's often overlooked.

Case1. When you need to downcast a handle to superclass to one to subclass bets are you are following Open CASCADE conventions:

Handle(Geom_Curve) aCurve = ...;
...
Handle(Geom_Line) aLine = Handle(Geom_Line)::DownCast (aCurve);
gp_Ax1 anAx1 = aLine->Position();
...

Case2. When you have a function returning a const Handle& you likely often write:

const Handle(MyClass)& GetMyClassInstance();

Handle(MyClass) anInstance = GetMyClassInstance();
//though you could write const Handle(MyClass)& anInstance

Both are totally valid code. But have you ever thought about what happens inside ? Remember that a Handle is not just simple type (like pointer), there is certain overhead inside. This overhead is to maintain a reference counter (increment, decrement) any time you copy a handle. If you use debugger and follow step-by-step instructions you will go through Handle_Standard_Transient::BeginScope(), EndScope(), copy constructor, assignment operator, etc.

This overhead is quite negligible when you deal with a few (hundreds ?) objects. However it may become noticeable when you are making performance-critical computations or deal with dozens of hundreds of objects. For instance, I did notice this when translating huge ACIS-SAT files with CAD Exchanger. Surprisingly BeginScope() and EndScope() appeared among top 5 hotspots.

Bad news is that such issue may often be hidden by other types, not necessarily handles themselves. The most frequent case, which Open CASCADE itself is vastly contaminated with is TopoDS_Shape. As you remember, TopoDS_Shape contains a field myTShape of the TopoDS_TShape type (subclass of Handle_Standard_Transient). So whenever you use something like:

1). TopoDS_Edge anEdge = TopoDS::Edge (aShape);
//instead of const TopoDS_Edge& anEdge = TopoDS::Edge (aShape);

or

2). const TopoDS_Shape& GetShape();

TopoDS_Shape aShape = GetShape();

//instead of const TopoDS_Shape& aShape = GetShape();

or

3). have a class returning TopoDS_Shape instead of const TopoDS_Shape& while it could have (e.g. when returning its own field)

class MyClass

{
public:
...
TopoDS_Shape Child() const { return myChild; }

//instead of const TopoDS_Shape& Child();

private:
TopoDS_Shape myChild;
};

You always assume a penalty of copy constructors and reference counters. Beware and pay attention to that !

Here are some quick recommendations on how to avoid this overhead:
  • a). use const Handle()& (or any other type) as return type whenever possible;
  • b). use const& as local variables whenever possible;
  • c). substitute Handle(MyClass)::DownCast() with direct cast (but only if you are certain of the type!):
Handle(Standard_Transient) aTarget;
const Handle(MyClass)& aMyClass = * static_cast
(&aTarget);

We touched c) in a very first post here. I'm currently thinking to extend Handle class with such a method to cast to const Handle&. Thus, Handle could cast the same as two C++ operators dynamic_cast and static_cast.

Any thoughts on this ?

2009-07-30

Scalability issues in CAD data exchange

(This is a re-post from Intel Software Network)

Continuing to eat our own dog food (i.e. to use Intel tools) let me share some recent practical experience.

I was measuring the parallelized part of the ACIS-SAT converter of CAD Exchanger (my personal project) to investigate its scalability. (SAT is a file format of ACIS, a well known 3D modeling kernel) The way to do this was to execute the same test case on the same architecture (e.g. Core 2 Duo or Core i7) changing the number of threads. Obviously, the ideal curve is 1/n where n is a number of cores. Below is the data I collected on a few files:



It includes measurements made on Core based Xeon 5300 (Clovertown) and the Corei7-based Xeon 3500 (Nehalem). Both are 4 cores with HT (hyper-threading). As you can see, Core i7 outpaces Core (each dashed curve is below the solid one) but the nature of the curve is the same for any file – it scales well from 1 to 2 to 4 but then becomes almost a constant (except maybe the pink one which corresponds to rather small ACIS file).

So why is that ? Discussing this issue with my colleagues at Intel, the first guess was that the application consumes all the bus bandwidth trying to load data from memory. That is, though the cores are free to execute the application cannot be fed it with data. We used another experimental Intel tool (Intel Performance Tuning Utility, available for free download at http://whatif.intel.com) to check this. This did not prove to be true – the experiments showed that though there are a lot of objects in memory they are allocated quite locally and can fit into Level 2 cache and so there are no bottlenecks of working with RAM.

So what else then ? Another guess was HT and that the OS perhaps not optimally distribute the workload among the cores and that a work running on the same core (there are actually 4 physical cores) but in different threads start competing for data. So I took another machine with 2 processors (i.e. 8 physical cores) with disabled HT. The data slightly changed – running in 5 threads gave speed up over 4 cores but then again, the time remained flat when increasing number of threads from 5 to 8.

To dive deeper into what happens I launched Intel Thread Profiler which (along with VTune Performance Analyzer) is a predecessor of Parallel Amplifier but has a nice feature which is not yet available in Amplifier – timeline.

I only focused on the parallelized part (highlighted on the image below).



As we can see, 7 other threads already finished their work (dark green became light green) while one thread (in this case, thread 1) continued to work. So this was an obvious example of non-scalability – adding more threads won’t help as there was only one working !

I execute parallel processing using tbb::parallel_for which has parameters of grain size and a partitioner. They define how the work is split into chunks. To track how TBB does this I used profiling API that added some marks to the timeline indicating start and end of processing each solid and their indexes.



We can see now that this time thread 5 was given the largest solids with indexes 14 and 15. That is, default values for grainsize and portioning strategy resulted that one thread was given several complex solids in a row which caused longer time to complete. Given that processing each solid involves multiple complex algorithms, it is safe to split processing into one solid at a time (this outweighs overhead implied by TBB for splitting). Here is a new code:

//use simple partitioner and grain size of 1 to ensure individual processing
tbb::parallel_for(tbb::blocked_range(1,aNbEntities+1,1),
ApplyPostProcess(this, theMap),
tbb::simple_partitioner() /*auto_partitioner()*/);

This immediately resulted in better performance! While one thread was processing a complex solid others were given those which remained. Of course, even in this case the duration will always be defined by the longest time of one solid processing. Moreover, even with such simplified approach there are still rooms for improvements – e.g., the workload can be sorted in the order of decreasing complexity. This will ensure that the ‘heaviest’ solids will be distributed and processed first, while ‘lighter’ solids will be deferred. This will significantly reduce chances of imbalances. Of course, this recommendation assumes that sorting is substantially faster than processing itself.

So, I hope this post will make you think how your workload is really parallelized, how you can use Intel tools to see this and how you can manage and improve it.

2009-07-21

CAD Exchanger Beta Update 1 is available

The CAD Exchanger product team is pleased to announce availability of the Beta Update 1 release of a new product to speed up translation of 3D data across multiple CAD formats.

Beta Update 1 can be freely downloaded from http://www.cadexchanger.com/download.html.

It addresses Beta users' feedback and introduces a few new features:
- New display mode - triangulation. Displays underlying triangulation of surface models.
- Ability to change precision of the surface models triangulation. This parameter affects visualization quality and conversion of the surface models into mesh formats (such as STL, VRML, etc)
- VRML 2.0 (1997) export is now supported.
- BRep and STL exporters now do not export hidden models (i.e. unchecked in the Model Explorer).
- Automatic checking for updates. User will be automatically notified when new product releases become available.
- Skipping invisible (blank) entities when importing IGES files at user's discretion.

For a full list please refer to Release Notes.

CAD Exchanger currently supports import, view, and export of IGES, STEP, STL, VRML and BRep formats (more are coming).

We continue to look forward to users' feedback that will help us prioritize development activities. Please share your experience at users' forum at http://www.cadexchanger.com/forum.


Sincerely,
The CAD Exchanger Team

2009-07-06

Developing parallel applications with Open CASCADE. Part 4

( continued...)

Open CASCADE threads
Open CASCADE provides thread abstraction in the form of OSD_Thread class that encapsulates OS-specific threads. It accepts a variable of the OSD_ThreadFunction type which is essentially a pointer to a function defined as follows:

typedef Standard_Address (*OSD_ThreadFunction) (Standard_Address data);

Here is a simple example of using Open CASCADE threads:

static Standard_Mutex aMutex;

Standard_Address Test_ThreadFunction (Standard_Address /*theData*/)
{
Standard_Mutex::Sentry aSentry (aMutex);
std::cout << "Running in worker thread id:" << OSD_Thread::Current() << std::endl;
}

int main (int argc, char *argv[])
{
const int MAX_THREAD = 5;
OSD_Thread aThreadArr[MAX_THREAD];

std::cout << "Running in master thread id: " << OSD_Thread::Current() << std::endl;

OSD_ThreadFunction aFunc = Test_ThreadFunction;
int i;
for (i = 0; i < MAX_THREAD; i++) {
aThreadArr[i].SetFunction (aFunc);
}
for (i = 0; i < MAX_THREAD; i++) {
if (!aThreadArr[i].Run (NULL))
std::cerr << "Error: Cannot start thread " << i << std::endl;
}

for (i = 0; i < MAX_THREAD; i++) {
Standard_Address aRes;
if (!aThreadArr[i].Wait (aRes))
std::cerr << "Error: Cannot get result of thread " << i << std::endl;
}
return 0;
}

The code of a function accepted by OSD_Thread::SetFunction() will be executed in a separate thread. Running the above example you should see different id's for master and worker threads.

OSD_Thread::Run() instance accepts a parameter of type void* which can be a pointer to some data object or is a type cast from some compatible type (e.g. integer). The same convention applies to an output parameter of OSD_Thread::Wait().

Of course, you are not obliged to use OSD_Threads and may prefer to use other implementations (system threads, Qt, Intel Threading Building Blocks, Open MP, etc). I have become a big fan of Intel TBB and am using it in CAD Exchanger.

Synchronization objects
Currently Open CASCADE provides the only synchronization object called Standard_Mutex which corresponds to Critical Section on Windows. You should get some knowledge on this subject to effectively use synchronization in your code.

For instance, to minimize contention you should protect the minimum required section of the code with a synchronization object. Using Standard_Mutex::Sentry you may create an additional nested scope for that:

void MyClass::MyMethodThatCanRunConcurrently()
{
//thread-safe code
...
{
//code requiring serialization
Standard_Mutex::Sentry aSentry (myMutex);
MyNotConcurrentMethod();
}

//thread-safe code again
...
}

The most convenient and efficient use of Standard_Mutex is through Standard_Mutex::Sentry which implements a scoped object which acquire and releases locks on a mutex. The Sentry object locks the mutex during its construction and unlocks it during destruction. This guarantees that the mutex will be unlocked even in the case of an exception raised during execution. Otherwise this could create a deadlock – other threads continue to wait for the locked mutex while there is no one who could unlock it.

It must be noted that Standard_Mutex is currently implemented inefficiently and therefore can hardly be recommended for use as is. Its method Standard_Mutex::Lock() is uses repetitive calls to TryEnterCriticalSection() and sleep(). This may result in wasting processor resources instead of yielding them to another task. It seems that developers attempted to implement a spin lock which can be useful for very short critical sections (where it is more efficient to spin than to immediately go into wait mode). However this must have been done differently – using critical section with spin count. I had to patched Standard_Mutex to eliminate this TryEnter...() behavior.

During my earlier experiments I have implemented other objects – wait condition and semaphore – but as they did not make part of Open CASCADE, there is no point describing them here.

(to be continued...)

2009-06-29

Developing parallel applications with Open CASCADE. Part 3

( continued...)

Data races
Data races is one of the most frequent problems that pop up when transitioning from sequential to parallel programming. When you develop a sequential code you usually make totally different assumptions on how you code would execute. It may require some imagination to understand how your code may work in multi-threaded environment and to anticipate which problems may happen.

Data races may result in really unpredictable application behavior – from sometimes correct results to non-reproducible incorrect results, and to spontaneous crashes. If you observe such behavior it may likely be due to a data race in your code or in 3rd party library. To catch it you can use Intel Parallel Inspector which is able to identify different memory and thread errors. Here is how it looks when catching an error in Open CASCADE:



Open CASCADE contains quite a lot of data races prone places if used in parallel applications as is. I first mentioned about data races in OpenCASCADE when experimented with IGES translation last year (search for forum threads on that). Working on ACIS importer revealed a few more. Most frequent examples are the following:

a). return of const& to a static variable inside the function. For instance:
const gp_Dir& gp::DX()
{
static gp_Dir gp_DX(1,0,0);
return gp_DX;
}

This works perfectly fine in a single-threaded application but may create a problem in a parallel one. When two threads simultaneously call gp::DX() for the first time, there is concurrent write-access to gp_DX variable and result is unpredictable. The fix in this case is to put definition outside of the function and thus to initialize the variable during load time:

static gp_Dir gp_DX(1,0,0);
const gp_Dir& gp::DX()
{
return gp_DX;
}

b). Use of static variables defined in a file.
These cases range from forced use when a variable cannot be part of an argument list or a class member to ugly usages that I can only attribute to non-understanding of C++. Fixes for these cases were symmetrical – from using Standard_Mutex whenever it was unavoidable to creating auto variables on a stack.

The example of the first case was GeomConvert_ApproxCurve.cxx which defines a C-style function that must use data outside of it (e.g. as a member of a class that calls it). The API of this function is prescribed elsewhere and itself is specified as an argument to some class.

static Handle(Adaptor3d_HCurve) fonct = NULL;
static Standard_Real StartEndSav[2];

extern "C" void myEval3d(Standard_Integer * Dimension,
// Dimension
Standard_Real * StartEnd,
// StartEnd[2]
Standard_Real * Param,
// Parameter at which evaluation
Standard_Integer * Order,
// Derivative Request
Standard_Real * Result,
// Result[Dimension]
Standard_Integer * ErrorCode)
// Error Code
{
...
StartEndSav[0]=StartEnd[0];
StartEndSav[1]=StartEnd[1];
...
}

In this case I had to protect fonct and StartEndSav with Standard_Mutex in the caller class to prevent data race. Reading Open CASCADE 6.3.1 Release Notes, I noticed that this issue was addressed and the OCC team made deeper changes introducing a class (not a C function) what allows to handle data transfer in a way not involving synchronization. This is good and I will be able to roll back my fixes.

c). Special case of BSplCLib, BSplSLib and PLib.

This case was already discussed in the blog last December in context of speeding up Boolean Operations. Let me briefly remind you the issue. For B-Splines calculations, version 6.3.0 (and prior) uses an auxiliary array of doubles allocated on heap; it is reallocated if new requested size is greater than previously allocated or is simply reused if already allocated buffer suffices. This is done in order to avoid frequent allocation/deallocation as this creates a bottleneck but obviously is not thread-safe.

So discussing then, Andrey Betenev suggested using auto variables (allocated on stack) as there is a limitation of a maximum B-Spline degree. Digging the code I realized that though in 99+% cases this will work, there can still be cases when any maximum threshold can be exceeded. This is because the number of poles in B-Splines is unlimited (and there are curves featuring more than 8192 poles!). So I implemented an improved alternative that use auto-buffers if the requested size is less than some threshold and uses heap otherwise.

template class BSplCLib_DataBuffer
{
public:

//8K * sizeof (double) = 64K
static const size_t MAX_ARRAY_SIZE = 8192;

BSplCLib_DataBuffer (const size_t theSize) : myHeap (0), myP (myAuto)
{
Allocate(theSize);
}

BSplCLib_DataBuffer () : myHeap (0), myP (myAuto) {}

virtual ~BSplCLib_DataBuffer()
{ Deallocate(); }

void Allocate (const size_t theSize)
{
Deallocate();
if (theSize > MAX_ARRAY_SIZE)
myP = myHeap = new T [theSize];
else
myP = myAuto;
}

operator T* () const
{ return myP; }


protected:

BSplCLib_DataBuffer (const BSplCLib_DataBuffer&) : myHeap (0), myP (myAuto) {}
BSplCLib_DataBuffer& operator= (const BSplCLib_DataBuffer&) {}

void Deallocate()
{
if (myHeap) {
delete myHeap;
myHeap = myP = 0;
}
}

T myAuto [MAX_ARRAY_SIZE];
T* myHeap;
T* myP;
};

(to be continued...)

2009-06-23

Developing parallel applications with Open CASCADE. Part 2

(continued...)
By the way, you must set this environment variable (MMGT_OPT) as well as other ones that control Open CASCADE memory management, before your application starts. This is important as selection of a memory manager is done during dll's loading. This is also a very inconvenient limitation that forbids to assign a custom memory manager in run-time.

Due to extensive memory allocation and deallocation during app life-cycle, it may become a hotspot. Look for instance, at the following screenshot received on a CAD Exchanger ACIS translator.



When measuring concurrency and waits & locks (the other two analysis types offered by Amplifier) using direct OS memory manager (as Open CASCADE one could not be used as explained in a previous post), I noticed that it also causes the greatest wait time.



On one hand, this is a very good indication that the rest of the code runs pretty fast. On the other, it indicates that memory management can really become a bottleneck. Analyzing the problem deeper I switched to the mode to see direct OS functions (toggling off the button on Amplifier toolbar) and here is what I saw:



What does it show to us ? That 2 system functions – RtlpFindAndCommitPages() and ZwWaitForSingleObject() – are hotspots and stack unwinding shows that they are called from memory allocation / deallocation routines !

The first hotspot is explained by the fact that ACIS translator creates multiple tiny objects containing results of translation (on this particular test case – 22000+). This causes strong memory fragmentation which forces the system to constantly look for new memory chunks.

The second (which goes through critical section) is caused by the default mechanism of memory management on Windows. As you might know, heap allocation in Windows is done sequentially using a mutex (critical section) with spin count of ~4000, i.e. when one thread requests memory allocation and in parallel another one tries to do the same, this latter thread does not go immediately to sleep mode but 'spins for 4000 times letting the former one to complete.

All this is caused by the direct use of calloc/malloc/free, and new/delete. To overcome this issue I have tried a technique offered by Intel Threading Building Blocks 2.2 which allows to substitute all calls to C/C++ memory management with calls to tbb allocator. This is done extremely easy with including a simple statement

#include "tbb/tbbmalloc_proxy.h"

The TBB allocator runs concurrently (without locks inside) and also works in a way similar to Open CASCADE – reuses once allocated blocks without returning them to the OS. This solved both hotspots issue and gave additional speed! Check out comparison of results:



On a side note, notice that on previous images there are not only calls via Standard_MMgrRaw::Allocate() and ::Free() (which are called via Standard::Allocate() and ::Free() when MMGT_OPT=0). There are also direct calls from arrays (TColgp_Array1OfPnt2d, etc) and others (highlighted in violet). They correspond to calls of new[] operator which is not redefined in Open CASCADE classes and therefore bypass Open CASCADE memory manager. Andrey Betenev once pointed this out to me, and here are the clear evidence of this observation. So, this should lead to an action on OCC team side, if they want to fix this oversight.

So, the summary are:
- default (optimized) Open CASCADE memory manager (MMGT_OPT=1) is not applicable to the parallel apps
- raw memory manager (MMGT_OP=0) forwarding calls to OS malloc/free can lead to fragementations and bottlenecks on workloads extensively using memory;
- to overcome this you need special memory manager, for instance Intel TBB that substitutes OS memory management routines.

(to be continued...)

2009-06-20

Performance tests of CAD Exchanger ACIS translator

As I already mentioned, I have been developing an ACIS translator for CAD Exchanger designing its architecture for multi-core from the very beginning. Today I performed a series of performance tests of its Alpha version executing multiple-files (from 150Kb to 55Mb) on different hardware (2/4/8 cores). The same set of tests have been executed against another ACIS translator and I am proud to state that mine has fully outperformed it ! The lowest gain was about 15% on a smaller file and the highest gain was 10x on the biggest one (it took CAD Exchanger ACIS translator only 23 seconds while the competitor needed 230seconds to translate a model of 55Mb). Obviously the latter gain will be most noticeable for the user.

I made some video running a live test of one model to demonstrate parallelism on an Intel Core i7 workstation. Sorry for sound artifacts – I was simply using my laptop for video recording, so you may notice hard disk crunching :-(.



There are several phases in the ACIS translator – parsing a file, creating a transient file model in memory, converting it into Open CASCADE shape, and eventually shape analysis and correction. Conversion and correction which take the largest share of time are fully concurrent and scale very well in most cases. Parsing a file is obviously a single-threaded task. And a phase of creating a transient model, even though architecture allows to make it fully parallel I had to execute sequentially being a hostage of Microsoft STL bug mentioned in an earlier post. With growing number of cores this phase becomes considerable comparing to others thus making the whole translation worse scalable. Hope that Microsoft will fix it earlier than I decide to implement own string parsing ;-).

The code takes advantage of Intel Threading Building Blocks which I loved for its ease of use, excellent scalability and very low overhead (I did not notice much difference running TBB with only one thread vs simple sequential execution). There are also a couple of know-how architectural algorithms that made this concurrency possible, which I am quite proud of;-).

Development of the ACIS translator will continue and should make it into the CAD Exchanger GUI eventually. It will also be offered as a C++ software library that you could use in your Open CASCADE-based application to read ACIS-SAT files. By the way, I was impressed to find hundreds of ACIS files all over the internet which really underlines its popularity.
If you would like to evaluate it please let me know.

Thanks !

2009-06-09

Developing parallel applications with Open CASCADE. Part 1

As mentioned in an earlier post, I am now working on an ACIS importer for CAD Exchanger and am developing it to be parallel. The results are very promising so far (except STL streams parsing due to the Microsoft bug mentioned in that post, though I'm now installing VS2008SP1 to check if it has been fixed). So I decided to share my experience in this blog and hope it will be useful for other developers (questions on the forum about parallelism appear more frequently now). Also hope that Open CASCADE team will benefit from these findings.

As a quick note already made several time, let me underline that in multi-core era parallel applications will become mainstream in some mid-term and you better prepare yourself for that trend now. This can be good for your career path and these skills can be among your competitive advantages. Recently released Intel Parallel Studio eases work of developers to develop and to debug multi-threaded applications (it has become a vital part of my toolbox). There are good basic books on this subject. I am now reading "Patterns for Parallel Programming" by Timothy Mattson et al (recommended by one my skilled colleague), an analogue of famous "Design Patterns" by Erich Gamma (the must read book for any professional software developer). It helped me shape the architecture of the ACIS importer for CAD Exchanger.

Getting back to Open CASCADE, I can definitively say that it is quite usable for parallel applications but like any other software library requires precautions.

General comments.
You will get major gain if you look at your problem from a higher level, from the problem domain, and not from a particular algorithm perspective. Identify what can be done concurrently and what requires sequential execution. For instance, in my initial experiments with IGES translation (see here) I intentionally focused on parallelizing translation of components of IGES groups and this worked reasonably well. But overall gain for entire translation was very modest because it was not this part that took most part – it was Shape Healing that ate about 50%-70% of a total time, and it remained sequential. So, until you restructure your algorithms your improvements will be limited. Working on the ACIS reader I had to completely redesign traditional architecture of waterfall approach (when a model is traversed from a root entity down to a leaf) and consequent shape healing. This gave tremendous outcome. The Patterns book gives concrete recommendations that help identify ‘patterns' in your algorithms.

Handles
With 6.2.x Open CASCADE handles (hierarchy of Handle_Standard_Transient subclasses) come with thread-safe implementation for reference counting. To take advantage of that you must either define MMGT_REENTRANT system variable to non-null or call Standard::SetReentrant (Standard_True) before you start using handles across threads. This makes reference counter to rely on atomic increment/decrement (e.g. InterlockedIncrement() on Windows) instead of ++ which can be not atomic.

void Handle(Standard_Transient)::BeginScope()
{
if (entity != UndefinedHandleAddress)
{
if ( Standard::IsReentrant() )
Standard_Atomic_Increment (&entity->count);
else
entity->count++;
}
}


Memory management
By default, Open CASCADE ships with MMGT_OPT variable defined to 1. This forwards all Standard::Allocate() and ::Free() calls to the Open CASCADE memory manager (Standard_MMgrOpt) which optimizes memory allocation mitigating memory fragmentations. (Probably it deserves a separate post to describe insights of the memory manager.)

Standard_MMgrOpt is thread-safe itself and safely regulates simultaneous memory allocation/deallocation requests from several threads. However it is based on Standard_Mutex (more on it in the future) which in its current implementation introduces extreme overhead and makes this optimized memory management worthless in parallel applications (while in a single threaded environment it works just fine).

So, to overcome this deficiency you should use MMGT_OPT=0. It activates Standard_MMgrRaw that simply forward calls to malloc/free…

(to be continued...)

2009-05-26

Intel Parallel Studio launched !

Today is THE day! Intel announced availability of the Parallel Studio, the suite of software tools to ease development of parallel applications. Work on it has been my best experience at Intel so far. I do adore it !

Respected 3rd party announcements: Dr. Dobbs, SD Times

Parallel Studio home, featuring Open CASCADE team's testimonial:

"Intel® Parallel Inspector and Intel® Parallel Amplifier greatly simplified the task of finding hotspots and memory leaks. We were pleased with the 2X overall performance improvement and the elimination of several previously unidentified memory leaks."
– Vlad Romashko
Software Development Manager
OpenCascade S.A.S.



2009-05-25

Parallelizing C++ streams

This post won't be directly about Open CASCADE but it may be helpful to those who are looking forward to parallelizing their applications. My personal strong belief is that multi-threading is unavoidable for long-term success and one must be prepared to make steps into that direction (better sooner than later). Era of free MHz is over, multi-threading is the only choice to scale in the future.

So. I am now developing an ACIS translator for CAD Exchanger and am designing its architecture to employ multi-threading whenever reasonable. Keeping it efficient, streamlined and light-weight forces to think and to refactor implementation several times but it pays off.

One of the examples where I was looking to integrate multi-threading was a conversion between persistent representation of an ACIS file (barely with a set of containers of ascii strings directly read from the file) into transient representation (which is a set of C++ objects representing ACIS types with data fields and cross-references between each other). The approach is that at the first pass empty transient objects are created and at the second their data is retrieved from persistent objects and references are created using already created placeholders. This allows to process each object fully independently from the others and thus represents an excellent case for multi-threading. First I made a sequential conversion and on the file of 15Mb consisting of 110,000+ entities this mapping took ~3 seconds. This became a benchmark to compete with.

To ease parsing of persistent entities strings I use C++ streams that wrap char* buffers, and operator >> to recognize doubles, integers, characters, etc. The makes the code concise and easy to understand. To enable parallelism I am using Intel Threading Building Blocks, a software library (available under commercial and Open Source license) facilitating solving many frequent tasks met in development of multi-threaded software including parallel loops, concurrent data containers, synchronization objects and so on. It already won several software awards and gains recognition of broad developers audience world-wide. It is developed in the same team that develops Parallel Amplifier, Inspector (and now Advisor) where I currently am at Intel.

The code looked quite straightforward (sequential part is commented out):

/*! \class ApplyPaste
\brief The ApplyPaste class is used for concurrent mapping of persistent representation into transient.

This is a functor class supplied to Intel TBB tbb::parallel_for(). It
uses ACISBase_RMapper to perform conversion on a range of entities that TBB
will feed to it.
*/
class ApplyPaste {
public:

//! Creates an object.
/*! Stores \a theMapper and model entites for faster access.*/
ApplyPaste (const Handle(ACISBase_RMapper)& theMapper) : myMapper (theMapper),
myMap (theMapper->File()->Entities())
{
}

//! Performs mapping on a range of entities
/*! Uses ACISBase_RMapper::Paste() to perform mapping.
The range \a r is determined by TBB.
*/
void operator() (const tbb::blocked_range& r) const
{
const ACISBase_File::EntMap& aPEntMap = myMap;
const Handle(ACISBase_RMapper)& aMapper = myMapper;
Handle (ACISBase_ACISObject) aTarget;
for (size_t i = r.begin(); i != r.end(); ++i) {
const Handle(ACISBase_PEntity)& aSource = aPEntMap(i);
aMapper->Paste (aSource, aTarget);
}
}
private:
const ACISBase_File::EntMap& myMap;
const Handle(ACISBase_RMapper)& myMapper;
};

/*! Uses either consequential or parallel implementation.
*/
void ACISBase_RMapper::Paste()
{
boost::timer aTimer;
Standard_Integer aNbEntities = myFile->NbEntities();

//parallel
tbb::parallel_for(tbb::blocked_range(0,aNbEntities), ApplyPaste(this),
tbb::auto_partitioner());

//sequential
//const ACISBase_File::EntMap& aPEntMap = myFile->Entities();
//Handle (ACISBase_ACISObject) aTarget;
//for (Standard_Integer i = 0; i < aNbEntities; i++) {
// const Handle(ACISBase_PEntity)& aSource = aPEntMap(i);
// Paste (aSource, aTarget);
//}

Standard_Real aSecElapsed = aTimer.elapsed();
cout << "ACISBase_RMapper::Paste() execution elapsed time: " << aSecElapsed << " s" << endl;
}

How disappointed was I to get a new elapsed time of ... 17sec on my new Core 2 Duo laptop (instead of 3secs in sequential code) ! What the heck ?! Obviously it could not be attributed to overhead caused by tbb, otherwise there was no point in using it. But what then ?

I immediately launched Intel Parallel Amplifier to see what goes wrong. Here is what I saw:




Unused CPU time (i.e. when one or more cores were not working) was 33.8s i.e. at least one core did not work. Hotspot tree showed that there was some critical section (a synchronization object that regulates exclusive access to some shared resource) called from std::_Lockit::_Lockit() constructor which itself most of the times was called from std::locale::facet::_Incref() or _Decref(). Mystery at the first glance. So I rebuilt my app in debug mode and started debugger and what ? Everything became clear.

The root cause is a critical section used to protect a common locale object. operator >>() inside creates a basic_istream::sentry object on the stack. Its constructor calls (through another method) ios_base::locale() which returns a std::locale object by calling its copy constructor (see syntax below). The copy constructor calls Incref() to increment a reference counter. Incrementing reference counter is surrounded by a critical section.

locale __CLR_OR_THIS_CALL getloc() const
{ // get locale
return (*_Ploc);
}

So, all streams compete for the same locale object! Moreover, critical section is created with spin count = 0. That means if one thread tries and fails to acquire a lock (enter critical section) while another thread is using it, it immediately goes into a sleep mode. When the lock gets freed the thread gets awaken. But all this is extremely expensive and therefore it creates that very overhead ! Should spin count be a non-null then it would run much faster – the spin count determines amount of tries to acquire a lock before the thread goes to sleep. For example, memory management routines (e.g. malloc()) use spin count of 4000 or so, and this makes multi-threaded apps run effectively even concurrently allocating memory. Why not to do the same for streams ?

OK, I tried to persist and talked to colleagues around. One of them gave me a link to http://msdn.microsoft.com/en-us/library/ms235505.aspx which among the rest discusses thread-specific locale. This looked promising. But after experimenting and reading http://msdn.microsoft.com/en-us/library/ms235302(VS.80).aspx I found this of no help :-(. The matter of the fact is that only C runtime locale can be made local to threads while C++ streams always use global locale::global. Gosh ! Microsoft ! Why ?!

So this is here I am now. I will keep on searching but if you ever dealt with using STL streams in multi-threaded environments or barely heard about this please let me know. I will appreciate any reference. The option to implement an own string parser (to avoid STL streams) is currently the least preferred but eventually is not excluded.

Nonetheless, experience with TBB runs extremely positively. I have integrated it into conversion part which converts transient representation into Open CASCADE shapes. Parallel execution and using of tbb::concurrent_hash_map for storage outperforms sequential implementation and will scale well over larger number of cores. I'll keep you posted on how it's going.

Take care,
Roman

2009-05-20

[off-topic] ¡Hola! – Back from Spain

It has been a long month after my previous post on this blog. As we achieved an RTM (Release-To-Manufacturing) milestone with the Parallel Studio I decided to take some break and we spent a few days in Spain with my family. This was a first time for me to be there and I would definitively like to return. We could not see everything we hoped to – swine flu had adjusted our plans as we arrived there :-(.

In Europe, Spain is known to be affected largely enough by the developing financial crisis. Though I deem the crisis is for good overall, I am very sympathetic to people who might be affected by it. Unlike many optimists out there I believe that much worse is still ahead of us.

What was good to see is that Spain does a lot of things that are a right answer for this downturn. This started with relatively low trip prices which did not seem believable a year ago. To retain tourist traffic the consulate opens a multi-entry Shengen visa valid for 180 days. For a reasonable trip price we got an excellent 3* hotel near Barcelona, one block from the sea, which exceeded some Egyptian 5* hotels. Upon arrival we were given a full boarding though we only paid for HB, and the food was spectacular. As it's not yet a high season (and we normally try to take advantages of this) and preparing for the worse, sellers are very open to offer discounts. As an example, Port Aventura, an entertainment park was offering 50% discount giving 2 days ticket for the price of 1. My daughter was happy to enjoy this.

Consequences of a recent boom are still observed. Like many other countries, Spain fell a victim of a real estate bubble. In a main street of a tourist town where we stayed real estate agencies were met every 100m or even more frequently. More than a half did not show signs of life for all our stay. Some have just been left abandoned. Stopped constructions, even in lovely locations, even almost completed, were frequent. Sales ads on every (!) multi-apartment house along the entire beach, sometimes up to 10+ on each. Natural consequences of craziness. It's good that Spain started it earlier, this is yet to come in Russia.

Regarding sight seeing, we spent most of the days in Barcelona. Of course, (almost) all "must see" locations – The Gothic Castle, Sagrada de Familia, Gaudi buildings, Guel park, Rambla and many other things. We were impressed with St. Maria del Mar church which appeared so small from outside and that monumental inside with a big Rose above the entrance. We could not make the Picasso museum twice, so this remains at least one reason to return. We went another day to Figueres, the home town of Salvador Dali, where he established a museum in the building of a once burnt out theater. What a special man ! His will was to get buried between two lavatory pans and it was fulfilled – his grave is under two working toilets in the museum. As we were walking through the museum I could not say it was too impressive (of course, tiredness added to that) but returning back home I re-thought and concluded that you can't assess a big thing when you are near it, you have to step back to realize it better.

So overall it was a great trip! If there are Spanish readers of this blog – you live in a great country, full of glory history and you have all the rights to be proud of it !

As for me, I returned back to work and we are now full steam ahead for new challenges and projects. I try to find some time to keep on developing CAD Exchanger. There are some interesting findings in that development so I hope to post some of them here.

Take care everyone.
Roman

2009-04-16

All-inclusive bounding box

Working on CAD Exchanger (http://www.cadexchanger.com) I have come across an issue that completely stumble me first. As I started working on mesher performance and gathered initial measurements on a set of sample models, I collected data on default precision (deflection) used to tessellate 3D models. By default, Open CASCADE visualization computes deflection as 0.4% of a maximum dimension of a bounding box. Don't ask me about any logic behind above number (I bet one day it was 1/1000 and then multiplied by 4 ;-)).

Some models were originally IGES or STEP and to speed up bounding box computations I imported them into CAD Exchanger and exported to BRep and used those BRep files. A few days later I decided to extend performance measurements on importers to gather more data on common algorithms. If anyone ever tried to profile Open CASCADE-based algorithms he/she likely noticed BSplCLib::Bohm() among hotspots, so I wanted to measure its total contribution into workload, not just at the phase of mesh computation. What was my surprise when the same models started reporting different deflections and bounding boxes! I could not blame my recent modifications as the same behavior was reproduced on a reference version. What the heck is that then ?!

By some lucky chance revelation came fast enough. As I was experimenting in DRAW, I noticed that bounding box changed (and significantly!) depending if the model had or did not have triangulation inside. So, when I first used brep files exported from CAD Exchanger they did contain tessellation (as they were displayed before export) and when I started using IGES or STEP in my test harness they did not. Wow !

To be sure of the assumption I dove into the source code – indeed, BRepBndLib::Add() has different branches depending if returned triangulation is null or not. That was it! What is noticeable is that bounding boxes built using triangulation are more precise comparing to those built using geometry representations. Certainly that is understandable.

Later the same day I encountered the same symptoms which were somewhat outstanding and made me dig further. Compare two pictures below – a bounding box on geometry representation is about twice as big along the X axis as a box built on tessellation!



Well, 5%-10% which I noticed until then were reasonable but 100% ?! A few minutes with DRAW and debugger root-caused a face and the code in charge. It was a tiny face that generated the excess. Debugging revealed that BndLib_AddSurface::Add() simply extends the bounding box to encompass all the poles of the underlying NURBS surface. Thus, even for a small face on large surface it adds a bounding box of a surface.



This seems to be a trade-off between speed and accuracy. In some cases such a rough approximation probably may even help – for instance, in some intersection algorithms. But it also may have a downside – in visualization. Imagine the following scenario. Create a brep model (from scratch or read from a file without triangulation), try to visualize it, internally it will compute a bounding box to define deflection, bounding box can be significantly larger than the original shape and hence a deflection will be larger. Save it to a brep file (it will store triangulation). Read it back and try to visualize again. This time it will compute bounding box using triangulation and chances are it will be smaller, and deflection will be smaller. Triangulation will therefore be recomputed and visualization mesh will be finer. Save it again and repeat ;-). So if you see examples when you import a brep file, visualize it, save it back and see different file sizes and/or finer display, this can be it!

I will never stop being surprised by Open CASCADE ;-)

2009-04-06

CAD Exchanger Beta is available!

This has been a long awaited moment - my personal project CAD Exchanger is up to being made publicly available! The first Beta is out and the official announcement is below. If anyone has to deal with CAD data translation you might want to use it as a preview or as an additional tool. Please submit your comments, bug reports, suggestions, recommendations, and what not at the user's forum (link below). Thanks!




April 6, 2009.

The CAD Exchanger product team is pleased to announce availability of the first Beta release of a new product to speed up translation of 3D models across multiple CAD file formats.

CAD Exchanger enables importing, viewing, and exporting models from and to IGES, STEP, STL and Open CASCADE BRep formats (more are coming). We hope you will enjoy streamlined interface and ease-of-use of the product.

We are looking forward to users’ feedback that will help us prioritize development activities. Please share your experience at users’ forum at www.cadexchanger.com/forum.

Download a fully free Beta version from www.cadexchanger.com/download.html and start using it today !

Sincerely,
The CAD Exchanger Team

2009-03-28

Unnoticeable memory leaks. Part 2

(Continued...)

As there are multiple objects referencing each other there is no single one to check a reference count of. So I had to choose a slightly different approach which you can reuse in your cases. I have created Standard_Transient* pointers to the objects I wanted to track counts of. Using Handles is non-applicable as assignment increases a ref count and therefore prevents destroying objects. To enforce destruction the handle objects are simply put inside the scope. Here is a code snippet:

Standard_Transient *apWS, *apModel, *apTR, *apTP, *apTW, *apFP, *apG;
{
IGESCAFControl_Reader aReader;
...
apWS = aReader.WS().Access();
apModel = (aReader.WS()->Model()).Access();
apTR = (aReader.WS()->TransferReader()).Access();
apTP = (aReader.WS()->TransferReader()->TransientProcess()).Access();
apTW = (aReader.WS()->TransferWriter()).Access();
apFP = (aReader.WS()->TransferWriter()->FinderProcess()).Access();
apG = (aReader.WS()->HGraph()).Access();
}

The objects' counters while within the scope ranged from 1 to 8 confirming they have been significantly reused inside the Data Exchange framework. When exiting the scope the cascade of objects is destroyed nullifying most references except one – of IGESData_IGESModel, the object that represents a graph of the IGES entities in memory.



This proved suspicions that memory is not fully cleaned upon reading of an IGES file. That is, once you have read an IGES file and started using 3D models contained in it, memory still contains a lot of unused information. Unnoticed memory leak.

Experimenting further I have found a root-cause which was in an extra handle to the IGES model inside the IGESToBRep_Actor class which itself always remained in the IGESControl_Controller which was supposed to reside in memory for the life cycle of the app (it's a registered in a global container). Allocated memory is freed upon next IGES file import and this repeats over and over again.

So, as a work-around I had to add the following line after the IGES file has been read.

Handle(IGESToBRep_Actor)::DownCast (aReader.WS()->TransferReader()->Actor())->SetModel (new IGESData_IGESModel);

Note that specifying a null model is impossible since IGESToBRep_Actor::SetModel() tries to access it assuming its non-null. You may want to add the above line to your code reading IGES.

Fortunately, similar checks in IGES export revealed that no memory is left orphaned. I only checked objects similar to those used in testing IGES import, not sure if there are no any tiny orphans.

Hope these findings will be useful for you and that you can check your code in critical places to make sure you do not produce memory leaks. Good luck !

(end)

P.S. I used the autoexp.dat for Microsoft Visual Studio debugger to facilitate display of the Open CASCADE handles. This was described in an earlier post.

Unnoticeable memory leaks. Part 1

(I have been relatively silent for several last days to respond to comments here on the blog and on the forum. My apologies if this frustrated you. At Intel, we are approaching a release cycle of Intel Parallel Studio and this requires a good deal of efforts to synchronize among multiple teams to ensure that something important is not overlooked. In addition to other activities, this eats much of my time).

This time I'd like to touch a subject what often does not get developers' attention it probably deserves. It's about memory leaks; and not about those which you can easily detect with specialized tools (including Intel Parallel Inspector, part of the Parallel Studio). It's not even about pseudo-leaks that some novice Open CASCADE users blame it of (I hope someday to write more about CASCADE memory management) when freed memory is not returned to the system but remains under the Open CASCADE memory manager. I mean cases when memory is still occupied by some (formally) useful contents but which your application doest not track/need any longer. This may result in orphan data residing in memory and eating out megabytes.

Let me show you a couple of particular examples that could inspire you to review your code for similar cases.

I was recently debugging my app which uses OCAF. There was a document class that contains a handle to OCAF document (TDocStd_Document) and a singleton application class that contains a handle to TDocStd_Application subclass. So once I put a breakpoint into my document destructor to check a reference count of the OCAF document. I expected it to be 1 so that upon destruction of my document, the ref count is decremented to 0 calling a destructor of TDocStd_Document and freeing its memory. To my surprise there a ref count was 3.



Tracing documentation creation revealed 2 additional references – one was from the global session object that registered all open documents and the other was from the TDocStd_Owner attribute that was attached to a root label (this is done automatically by OCAF). Searching around promptly gave an answer – I forgot to call TDocStd_Application::Close() before destroying my document. Once this had been done, document reference count correctly decremented to 0 and started freeing occupied memory.

The point is that until I discovered this manually, there were no any symptoms that document contents (which can be quite memory consuming for a complex structured doc) remained orphaned. The application continued to run and exit gracefully.

Another time I intentionally tried to inspect memory deallocation when importing and exporting IGES. The Data Exchange framework in Open CASCADE is too feature-rich and its architecture involves numerous classes with references between each other. My concern was that there could be cyclic handle cross-references making objects non-destroyable (recall my first posts on Handles about this potential issue). Another suspicion was that some data are not fully cleaned upon after every translation. So I went and used the debugger to see what happens...

(to be continued...)

2009-03-18

Distributing your software

Many software developers often prefer to focus on code design and development. They find writing technical specifications, fixing bugs, composing documentation or making installers too boring for their creative minds. Yes, developing with the Open Source free CAD/CAM/CAE kernel can be fascinating, but productization is a mandatory part of any successful software product.

Let me share with you a relatively easy way how to streamline your installation of your Open CASCADE based application. I will focus on defining your redistributables and setting environment variables.

Dynamic libraries
Here is how you can check which dynamic libraries you need (on Windows):
1. Run Depends.exe which is part of MS Visual Studio installation (e.g. c:\Program Files\Microsoft Visual Studio 8\Common7\Tools\Bin\Depends.exe) and load your executable (or a toolkit). The application will build a dependency tree of dynamic libraries. Pick up those which belong to Open CASCADE (and to other toolkits you might be using).
2. Add TKOpenGl.dll as it’s not directly linked and is loaded at run-time.
3. If you use OCAF, add FWOSPlugin.dll and other dlls corresponding to your formats. For instance, if you use XDE and Xml persistence and do not add your own attributes you need XmlXCAFPlugin.dll and other libraries it depends on. Again, use Depends.exe to unwind dependencies.

Resource files
The number of resource files you will need to ship depends on which modules you use. Here are ones that can be most frequently used:
1. Data Exchange.
Include %CASROOT%\src\SHMessage\SHAPE.<>, %CASROOT%\src\XSMessage\IGES.<>, XSTEP.<> where <> is either us or fr depending on which you intend to use. <> is controlled by the CSF_LANGUAGE variable and defaults to us.
Add %CASROOT%\src\XSTEPResource\IGES, STEP. You will need to set system variables CSF_SHMessage, CSF_XSMessage, CSF_IGESDefaults, CSF_STEPDefaults to refer to directory(ies) where the files are installed.
2. OCAF
You will need a resource file named <> and a variable CSF_ <> Defaults with contents describing a format, file extensions, and identifiers (GUIDs) of storage and retrieval document drivers. XDE applications may directly use the XCAF file from the StdResource directory.
Then you will need a file named Plugin and a CSF_PluginDefaults variable. The Plugin file must contain a line to refer to FWOSPlugin.dll and other drivers if your application maintains store/retrieve operations.
If you use Xml storage format, you will need %CASROOT%\src\XmlOcafResource and the CSF_XmlOcafResource variable.

3. Textures
If you plan to use prebuilt textures you will need to ship files from %CASROOT%\src\Textures and to set the CSF_MDTVTexturesDirectory variable.

There are also fonts and units-related files but I don’t remember any use case where they might be needed. If anyone does please reply with the comments.

System variables
You will need to set the following variables:
- CSF_GraphicShr (set to TKOpenGl.dll or a full path where it is located)
- Those mentioned above (if you use respective resource files)
- Optionally MMGT_* that control memory management (e.g. MMGT_OPT=1 activates Open CASCADE memory manager while MMGT_OPT=0 – standard malloc/free).

Setting environment variables can be done via a bat file which can be created to use some single variable that is set according to the user environment, and all other variables are set relative to it. I am now using another way – setting all variables inside the application itself. This makes it more robust. The following code is executed before main():

//set environment vars to not depend on pre-set environment
OSD_Environment ("CSF_LANGUAGE", "us").Build();
OSD_Environment ("CSF_GraphicShr", "TKOpenGl.dll").Build();

const int aMaxPath = 256;
TCHAR szPath[aMaxPath];
char aPath [aMaxPath];
DWORD aRes = GetModuleFileName (NULL, szPath, aMaxPath);
QOLib_ASSERT (aRes);
char *p = aPath, *q = (char*)szPath;
while (*p++ = *q++) {
q++;
}
TCollection_AsciiString aCasRootDir (aPath);
aCasRootDir.Trunc (aCasRootDir.SearchFromEnd ("\\") - 1);
aCasRootDir.Trunc (aCasRootDir.SearchFromEnd ("\\") - 1);
OSD_Environment ("CASROOT", aCasRootDir).Build();

TCollection_AsciiString aResDir = aCasRootDir + "/resources";
OSD_Environment ("CSF_SHMessage", aResDir).Build();
OSD_Environment ("CSF_XSMessage", aResDir).Build();
...

The only limitation is that MMGT_* cannot be set this way, as they are also verified before the main() function. This does not bother me as I’m using default values, likely you won’t be affected either.

The approach above allows you to launch your executable from any location (Start menu, its working directory, etc) without a headache of prior tweaking a user’s environment.

After including required libraries and resource files, and setting environment variables, you should be ready to go. Good luck!