2009-08-07

const Handle & vs Handle

Quick post. The addressed issue may seem quite obvious for professional C++ developers yet it's often overlooked.

Case1. When you need to downcast a handle to superclass to one to subclass bets are you are following Open CASCADE conventions:

Handle(Geom_Curve) aCurve = ...;
...
Handle(Geom_Line) aLine = Handle(Geom_Line)::DownCast (aCurve);
gp_Ax1 anAx1 = aLine->Position();
...

Case2. When you have a function returning a const Handle& you likely often write:

const Handle(MyClass)& GetMyClassInstance();

Handle(MyClass) anInstance = GetMyClassInstance();
//though you could write const Handle(MyClass)& anInstance

Both are totally valid code. But have you ever thought about what happens inside ? Remember that a Handle is not just simple type (like pointer), there is certain overhead inside. This overhead is to maintain a reference counter (increment, decrement) any time you copy a handle. If you use debugger and follow step-by-step instructions you will go through Handle_Standard_Transient::BeginScope(), EndScope(), copy constructor, assignment operator, etc.

This overhead is quite negligible when you deal with a few (hundreds ?) objects. However it may become noticeable when you are making performance-critical computations or deal with dozens of hundreds of objects. For instance, I did notice this when translating huge ACIS-SAT files with CAD Exchanger. Surprisingly BeginScope() and EndScope() appeared among top 5 hotspots.

Bad news is that such issue may often be hidden by other types, not necessarily handles themselves. The most frequent case, which Open CASCADE itself is vastly contaminated with is TopoDS_Shape. As you remember, TopoDS_Shape contains a field myTShape of the TopoDS_TShape type (subclass of Handle_Standard_Transient). So whenever you use something like:

1). TopoDS_Edge anEdge = TopoDS::Edge (aShape);
//instead of const TopoDS_Edge& anEdge = TopoDS::Edge (aShape);

or

2). const TopoDS_Shape& GetShape();

TopoDS_Shape aShape = GetShape();

//instead of const TopoDS_Shape& aShape = GetShape();

or

3). have a class returning TopoDS_Shape instead of const TopoDS_Shape& while it could have (e.g. when returning its own field)

class MyClass

{
public:
...
TopoDS_Shape Child() const { return myChild; }

//instead of const TopoDS_Shape& Child();

private:
TopoDS_Shape myChild;
};

You always assume a penalty of copy constructors and reference counters. Beware and pay attention to that !

Here are some quick recommendations on how to avoid this overhead:
  • a). use const Handle()& (or any other type) as return type whenever possible;
  • b). use const& as local variables whenever possible;
  • c). substitute Handle(MyClass)::DownCast() with direct cast (but only if you are certain of the type!):
Handle(Standard_Transient) aTarget;
const Handle(MyClass)& aMyClass = * static_cast
(&aTarget);

We touched c) in a very first post here. I'm currently thinking to extend Handle class with such a method to cast to const Handle&. Thus, Handle could cast the same as two C++ operators dynamic_cast and static_cast.

Any thoughts on this ?

5 comments:

  1. Copy constructors for smart pointers (I have experience mostly with Boost keeps 2 integers (reference count and weak reference count) and the pointer. The copy on a 64 bit machine is 16 bytes (as smart pointer classes have no virtual methods so no vtable).

    Instead DownCast I would still use the dynamic_cast when needed, as :
    Handle(Standard_Transient) aTarget;
    const Handle(MyClass)& aMyClass = * dynamic_cast <MyClass>(&aTarget);

    There is overhead but is still small: http://archives.devshed.com/forums/kde-96/dynamic-cast-performance-overhead-821502.html
    It is about 1ms at 2k objects on an older machine.

    I think as you have all tools, looking to put const & anywhere is a good practice as you pointed out but will get much less performance than a bad algorithm or a bad locking (as you already shown up).

    The best way to achieve more I think you can get from using profile guided optimizations: http://msdn.microsoft.com/en-us/library/aa289170%28VS.71%29.aspx (Visual Studio) or http://jasondclinton.livejournal.com/70872.html (gcc). I don't know how to setup on Intel's command line, but I know that PGO are standard feature in compilers.

    Here is a review that the speed increase is about 10% using PGO: http://cboard.cprogramming.com/tech-board/111902-pgo-amazing.html#post832951

    ReplyDelete
  2. Hi ciplogic,

    as I'm having quite a time-consuming OCC problem - I decided to have a look at the techniques you described.

    I read the article on VS and recompiled OCC using Whole Program Optimization with Link-Time Code Generation and then recompiled my app just to see what happens. Unfortunately, the results were not very promising: 1-2% speed up... not much. maybe PGO will bring more.

    Pawel

    ReplyDelete
  3. 1-2% seems to few for me. It is like a rounding benchmark error. I think that you need PGO to train the compiler with biggest files you have (to make compiler to have hotspots that can optimize) or your speed improvement may be CPU wise bigger but you are still blocked in other operations like multi-threaded locking or I/O.

    As my experience goes, you should get at least a 5% improvement, around 10% is normal, and 15%-25% if you are lucky (meaning that code is highly mathematical and the compiler can optimize a lot of blocks removing parts that are useless).

    As const &, if you want to use "esoteric" optimizations, you may move most code in headers (to be more inlined) and to pick a faster calling convention (like __fastcall) but those optimizations are not good in a long term, also may not be compatible with other applications that expect the ABI to be in the old way. The win in short term may be a hell in maintenance when you will hit by bugs that you don't know which kind of data you have in your reference.

    ReplyDelete
  4. Hi ciplogic,

    Thanks for sharing your thoughts and educating on PGO. I have never used it myself though. Do you have an idea how sensitive it is to sample workloads that an instrumented code is trained with ? As I suspect and as the MSDN article confirms, non representative scenarios can make the code run slower than without PGO. Testing CAD Exchanger I try to go with different models - from medium to large - to measure performance. Normally these represent average model in each class but chances are that my users may use something different.

    On copy constructor and const Handle& - my point was not overhead of the size involved whatsoever but on the code which is executed. And this code is *not* inlined !!! See Handle_Standard_Transient.cxx.

    The size itself is of less concern of course but everyone still needs to remember that Handle has a size of 2 pointers (vtbl + Standard_Transient*).

    Yes, the good algorithm is what will give you the most result. However when you have to deal with legacy or 3rd party tool, some tweaks and hints (like PGO you mention) can be good help. So thanks again for referring to this technique !

    Roman

    ReplyDelete
  5. In case that Handle is based on Standard_Transient (you have completely right!) the Handle_ overhead is big (to not say huge!). Using boost's smart pointer implementations, the access of them was largely the same as a pointer dereference. There is no virtual call, and is pretty well optimized with assembler depending on your machine architecture. The save was there and by code-style we were required to use for all classes the const &. But in practice even it was a OpenGL code, the const & was minimum win in case of boost's smart pointers on 64 bit code (const & was a 8 byte push on stack compared with a 16 byte push on stack, and two atomic increment and decrement operations).

    So you have right as seems that the Handle's overhead may get much bigger cause of VTable's and virtual calls. In case of boost, the smartpointers were plain templates which can be inlined as were in headers and have no vtable (meaning no linked calls like in destructors chains).

    So is my mistake to consider the Handle's time usage comparable with Boost counterpart.

    The PGO are at least on gcc case that the functions with high reference usage are inlined more aggressively. It also can split a function in more parts. The less used part may be extracted from function body increasing cache locality. (source about GCC: http://www.scribd.com/doc/16080629/GCC-Profile-Guided-Optimization )

    About the speed loss using PGO, it mostly not happen, because you as a person that want to optimize, you mostly pick the big files that gets the most win from profiling. If it will work less efficient in the low count primitives, even by 100% factor, it will be let's say a jump from 1 second to 2 seconds. But if in the case of big files the win will be from 60 seconds to 55, the main win for a batch conversion of files will be greater. Of course no one will be happy to wait too long.

    Out of topic: PGO was done much earlier in JITs in a dynamic manner, in Hotspot JVM, where a profiler did dynamically trace the usage code lines count to know how much inline should do for "hot" areas.

    ReplyDelete