I have implemented my own kd-tree library based on Accelerating kd-tree searches for all k -nearest neighbours (pdf), compared the result with ANN to confirm that it outputs the correct result. It was surprising my implementation was several hundred times faster (below graph, for the same queries, ANN gets 14728 samples while my code gets 15). I still doubt if there's something wrong with my code or profiling because it's too fast even after considering the fact that my kd-tree is specialized in finding nearest *one* point and uses Eigen::Vector4f for SSE optimisation.
I tested several sampling profiler: perf, gperftools, oprofile, codeXL. It took some time for me to get the correct call graph. Many profilers output a call graph where malloc/new has no parent, probably because libstdc++ is not compiled with no-omit-frame-pointer and cannot find the caller function. Newer profilers such as recent perf and gperftools can handle it correctly. I couldn't though find a way to make perf report's "-G" option work when I took a profile with
perf record -call-graph dwarf ...
so I used gperftools for it (above graph). It still took a while until I found pprof's 'callgrind' option (kcachegrind) could not output the result correctly. 'web' worked fine.