(This is a re-post from Intel Software Network)
Continuing to eat our own dog food (i.e. to use Intel tools) let me share some recent practical experience.
I was measuring the parallelized part of the ACIS-SAT converter of CAD Exchanger (my personal project) to investigate its scalability. (SAT is a file format of ACIS, a well known 3D modeling kernel) The way to do this was to execute the same test case on the same architecture (e.g. Core 2 Duo or Core i7) changing the number of threads. Obviously, the ideal curve is 1/n where n is a number of cores. Below is the data I collected on a few files:
It includes measurements made on Core based Xeon 5300 (Clovertown) and the Corei7-based Xeon 3500 (Nehalem). Both are 4 cores with HT (hyper-threading). As you can see, Core i7 outpaces Core (each dashed curve is below the solid one) but the nature of the curve is the same for any file – it scales well from 1 to 2 to 4 but then becomes almost a constant (except maybe the pink one which corresponds to rather small ACIS file).
So why is that ? Discussing this issue with my colleagues at Intel, the first guess was that the application consumes all the bus bandwidth trying to load data from memory. That is, though the cores are free to execute the application cannot be fed it with data. We used another experimental Intel tool (Intel Performance Tuning Utility, available for free download at http://whatif.intel.com) to check this. This did not prove to be true – the experiments showed that though there are a lot of objects in memory they are allocated quite locally and can fit into Level 2 cache and so there are no bottlenecks of working with RAM.
So what else then ? Another guess was HT and that the OS perhaps not optimally distribute the workload among the cores and that a work running on the same core (there are actually 4 physical cores) but in different threads start competing for data. So I took another machine with 2 processors (i.e. 8 physical cores) with disabled HT. The data slightly changed – running in 5 threads gave speed up over 4 cores but then again, the time remained flat when increasing number of threads from 5 to 8.
To dive deeper into what happens I launched Intel Thread Profiler which (along with VTune Performance Analyzer) is a predecessor of Parallel Amplifier but has a nice feature which is not yet available in Amplifier – timeline.
I only focused on the parallelized part (highlighted on the image below).
As we can see, 7 other threads already finished their work (dark green became light green) while one thread (in this case, thread 1) continued to work. So this was an obvious example of non-scalability – adding more threads won’t help as there was only one working !
I execute parallel processing using tbb::parallel_for which has parameters of grain size and a partitioner. They define how the work is split into chunks. To track how TBB does this I used profiling API that added some marks to the timeline indicating start and end of processing each solid and their indexes.
We can see now that this time thread 5 was given the largest solids with indexes 14 and 15. That is, default values for grainsize and portioning strategy resulted that one thread was given several complex solids in a row which caused longer time to complete. Given that processing each solid involves multiple complex algorithms, it is safe to split processing into one solid at a time (this outweighs overhead implied by TBB for splitting). Here is a new code:
//use simple partitioner and grain size of 1 to ensure individual processing
This immediately resulted in better performance! While one thread was processing a complex solid others were given those which remained. Of course, even in this case the duration will always be defined by the longest time of one solid processing. Moreover, even with such simplified approach there are still rooms for improvements – e.g., the workload can be sorted in the order of decreasing complexity. This will ensure that the ‘heaviest’ solids will be distributed and processed first, while ‘lighter’ solids will be deferred. This will significantly reduce chances of imbalances. Of course, this recommendation assumes that sorting is substantially faster than processing itself.
So, I hope this post will make you think how your workload is really parallelized, how you can use Intel tools to see this and how you can manage and improve it.