Modern processors and computer systems are designed to be efficient and achieve
high performance with applications that have regular memory access patterns.
For example, dense linear algebra routines can be implemented to achieve near
peak performance. While such routines have traditionally formed the core of
many scientic and engineering applications, commercial workloads like database
and web servers, or decision support systems (data warehouses and data mining)
are one of the fastest growing market segments on high-performance computing
platforms. Many of these commercial applications are characterised by more complex
codes and irregular memory access patterns, which often result in a decrease
of performance that is achieved. Due to their complexity and the lack of source
code, performance analysis of commercial applications is not an easy task. Hardware
performance counters allow detailed analysis of program behaviour, like number
of instructions of various types, memory and cache access, hit and miss rates,
or branch mispredictions. In this paper we describe experiments and present
results conducted with various KDD applications on an UltraSPARC-III platform,
and we compare these applications with an optimised dense matrix-matrix multiplication.
We focus on compiler optimisations using the -fast and discuss dierences in
unoptimised and optimised codes.