Computational Performance Art

I am working with a number of people who are convinced that the application we are developing is inherently computationally intensive; and that is why we are not getting performance that some number of other people think we should be getting. In my experience, folks who build a slow application due to design issues tend to blame the hardware, the compiler, the complexity of the application, but only lastly and only after being shown it is the case.
In my case, I am highly suspect of the design, but I have to convince the folks making the decision that it is not a question of hardware.

Theoretical Peak Performance
How much power does a modern quad core Xeon have?
XEON “Clovertown” retires 4 floating point operations [FLOPS] per clock, so a 3GHz “Clovertown” should do 3 billion x 4 FLOPS x 4 cores per second, or 48 GFLOPS theoretical peak with 64-bit precision. That’s 48 billion double precision floating point operations per second.
List_of_Intel_Xeon_microprocessors
Intel Core Architecture

How does this relate in terms of our problem?
Assume much of our work is 3×3 float matrix multiply [MM]. One 3×3 MM is 3 multiplys and 2 adds for each of 9 elements = 27 multiplys and 18 adds, or 55 FLOPs. At peak, “Clovertown” should do 873 million 3×3 MM per second. The number of configurations I’d like to check per second is about 10,000. I should be able to devote 87,000 MM per configuration per second. Based on what I guess of our computational coasts, this seems like plenty of juice.
Even if I only had one core to use in a single thread, I would have 20,000 MM per configuration. This should put to rest our initial desire to use 8 different CPUs and take the added complexity of IPC.

What is the computational cost of our problem?
This question is something I have not been able to answer, nor have anyone answer adequately. I need to find some good docs preferably, or less ideally do some call stack tracing through the code.

Profiling Issues
Presently, profiling with VTune suggests that I have about eight computational domains each taking between 10 and 15 percent of CPU. This is not what one wants for an easy speed fix. I could optimize out any one of these domains and only see a 10% increase in speed.
At one point malloc/free was taking around 30%, and I put Google Perf Tools to work which brought it down to 10%. This was a very nice win. Only took a few days to do and required no change in architecture, just adding the right link line.

Cache Miss City
One of the possible problems that would really bring performance down is cache misses. The current architecture is base on dealing with single datums at a time rather than whole arrays of data. The genesis of these design decisions rests squarely with managements instance on a Model Driven Architectural way of doing things; also possibly from the developers wholesale embrace of object oriented mis-thinking.

Network Bandwidth
One datum at a time is what I suspect to be one of the pillars of our problem. Why is this so costly? Because we kill the cache. We give up both code and data locality. We guarantee that each datum requires at least two context switches.

Switching from one running process to another running process incurs a cost to the system. The values of all the registers must be saved in the present state, the status of all open files must be recorded and the present position in the program must be recorded. Then the contents of the MMU must be stored for the process (see next chapter). Then all those things must be read in for the next process, so that the state of the system is exactly as it was when the scheduler last interrupted the process. This is called a context switch. Context switching is a system overhead. It costs real time and CPU cycles, so we don’t want to context switch too often, or a lot of time will be wasted.

Using Netperf 2.4.2
Let’s do a little digging.

[andrew]$ ./netperf -a 4096 -t STREAM_STREAM

STREAM STREAM TEST

Recv   Send    Send

Socket Socket  Message  Elapsed

Size   Size    Size     Time     Throughput

bytes  bytes   bytes    secs.    10^6bits/sec110592  110592  110592    10.00    20369.26

[andrew]$ ./netperf  -t STREAM_STREAM

STREAM STREAM TEST

Recv   Send    Send

Socket Socket  Message  Elapsed

Size   Size    Size     Time     Throughput

bytes  bytes   bytes    secs.    10^6bits/sec

110592  110592  110592    10.00    16062.94

[andrew]$

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592  110592    10.00      2.48   99.90  99.90   0.768   6445907.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592  110592    10.00         2.48   99.90  99.90   0.768   6445907.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM -- -m 128 -M 128
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592     128    10.00         0.02   99.25  99.25   105.632  886107072.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 1       1      10.00   44428.70   80.85  80.91  36.397  933560.938
110592 110592

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR -- -r 100,100
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 100     100    10.00   45457.03   80.10  80.20  35.244  904538.250
110592 110592

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR -- -r 100K,100K
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 100     100    10.00   45404.84   80.20  80.35  35.328  907267.312
110592 110592

bash-3.00$ ./netperf -c -C -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    10^6bits/s  %      %       us/KB   us/KB

110592  110592  110592    10.00      16154.77   97.50  97.50   0.989   8294972.500

FLOPS for 3×3 Inversion

m3 = {{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}

{{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}

Inverse[m3]

{{(-a23 a32+a22 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(

        a13 a32-a12 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a13 a22+a12 a23)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)},{(

        a23 a31-a21 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a13 a31+a11 a33)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(

        a13 a21-a11 a23)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33)},{(-a22 a31+a21 a32)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(

        a12 a31-a11 a32)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a12 a21+a11 a22)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)}}

The denominator is common taking 18 multiplys and 5 adds. The numerators take 18 multiplys and 9 adds. Then there are 9 divides for the elements. That is 36 mutiplys, 9 divides, and 5 adds for a total of 50 FLOPS for a 3×3 inversion.

Leave a Reply

You must be logged in to post a comment.