Archive for December, 2007

Update Wordpress

Friday, December 28th, 2007

Time to update.
done!

Computational Performance Art

Friday, December 28th, 2007

I am working with a number of people who are convinced that the application we are developing is inherently computationally intensive; and that is why we are not getting performance that some number of other people think we should be getting. In my experience, folks who build a slow application due to design issues tend to blame the hardware, the compiler, the complexity of the application, but only lastly and only after being shown it is the case.
In my case, I am highly suspect of the design, but I have to convince the folks making the decision that it is not a question of hardware.

Theoretical Peak Performance
How much power does a modern quad core Xeon have?
XEON “Clovertown” retires 4 floating point operations [FLOPS] per clock, so a 3GHz “Clovertown” should do 3 billion x 4 FLOPS x 4 cores per second, or 48 GFLOPS theoretical peak with 64-bit precision. That’s 48 billion double precision floating point operations per second.
List_of_Intel_Xeon_microprocessors
Intel Core Architecture

How does this relate in terms of our problem?
Assume much of our work is 3×3 float matrix multiply [MM]. One 3×3 MM is 3 multiplys and 2 adds for each of 9 elements = 27 multiplys and 18 adds, or 55 FLOPs. At peak, “Clovertown” should do 873 million 3×3 MM per second. The number of configurations I’d like to check per second is about 10,000. I should be able to devote 87,000 MM per configuration per second. Based on what I guess of our computational coasts, this seems like plenty of juice.
Even if I only had one core to use in a single thread, I would have 20,000 MM per configuration. This should put to rest our initial desire to use 8 different CPUs and take the added complexity of IPC.

What is the computational cost of our problem?
This question is something I have not been able to answer, nor have anyone answer adequately. I need to find some good docs preferably, or less ideally do some call stack tracing through the code.

Profiling Issues
Presently, profiling with VTune suggests that I have about eight computational domains each taking between 10 and 15 percent of CPU. This is not what one wants for an easy speed fix. I could optimize out any one of these domains and only see a 10% increase in speed.
At one point malloc/free was taking around 30%, and I put Google Perf Tools to work which brought it down to 10%. This was a very nice win. Only took a few days to do and required no change in architecture, just adding the right link line.

Cache Miss City
One of the possible problems that would really bring performance down is cache misses. The current architecture is base on dealing with single datums at a time rather than whole arrays of data. The genesis of these design decisions rests squarely with managements instance on a Model Driven Architectural way of doing things; also possibly from the developers wholesale embrace of object oriented mis-thinking.

Network Bandwidth
One datum at a time is what I suspect to be one of the pillars of our problem. Why is this so costly? Because we kill the cache. We give up both code and data locality. We guarantee that each datum requires at least two context switches.

Switching from one running process to another running process incurs a cost to the system. The values of all the registers must be saved in the present state, the status of all open files must be recorded and the present position in the program must be recorded. Then the contents of the MMU must be stored for the process (see next chapter). Then all those things must be read in for the next process, so that the state of the system is exactly as it was when the scheduler last interrupted the process. This is called a context switch. Context switching is a system overhead. It costs real time and CPU cycles, so we don’t want to context switch too often, or a lot of time will be wasted.

Using Netperf 2.4.2
Let’s do a little digging.

[andrew]$ ./netperf -a 4096 -t STREAM_STREAM

STREAM STREAM TEST

Recv   Send    Send

Socket Socket  Message  Elapsed

Size   Size    Size     Time     Throughput

bytes  bytes   bytes    secs.    10^6bits/sec110592  110592  110592    10.00    20369.26

[andrew]$ ./netperf  -t STREAM_STREAM

STREAM STREAM TEST

Recv   Send    Send

Socket Socket  Message  Elapsed

Size   Size    Size     Time     Throughput

bytes  bytes   bytes    secs.    10^6bits/sec

110592  110592  110592    10.00    16062.94

[andrew]$

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592  110592    10.00      2.48   99.90  99.90   0.768   6445907.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592  110592    10.00         2.48   99.90  99.90   0.768   6445907.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM -- -m 128 -M 128
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592     128    10.00         0.02   99.25  99.25   105.632  886107072.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 1       1      10.00   44428.70   80.85  80.91  36.397  933560.938
110592 110592

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR -- -r 100,100
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 100     100    10.00   45457.03   80.10  80.20  35.244  904538.250
110592 110592

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR -- -r 100K,100K
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 100     100    10.00   45404.84   80.20  80.35  35.328  907267.312
110592 110592

bash-3.00$ ./netperf -c -C -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    10^6bits/s  %      %       us/KB   us/KB

110592  110592  110592    10.00      16154.77   97.50  97.50   0.989   8294972.500

FLOPS for 3×3 Inversion

m3 = {{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}

{{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}

Inverse[m3]

{{(-a23 a32+a22 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(

        a13 a32-a12 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a13 a22+a12 a23)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)},{(

        a23 a31-a21 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a13 a31+a11 a33)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(

        a13 a21-a11 a23)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33)},{(-a22 a31+a21 a32)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(

        a12 a31-a11 a32)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a12 a21+a11 a22)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)}}

The denominator is common taking 18 multiplys and 5 adds. The numerators take 18 multiplys and 9 adds. Then there are 9 divides for the elements. That is 36 mutiplys, 9 divides, and 5 adds for a total of 50 FLOPS for a 3×3 inversion.

KML Goodness

Saturday, December 15th, 2007

http://drewk.net/NetBeansProjects.zip

/*
* Writting a KML facility to help with FCS debugging.
*/
package gekml;

import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;

/**
* < ?xml version="1.0" encoding="UTF-8"?>
*
* * Simple placemark
* Attached to the ground. Intelligently places itself
* at the height of the underlying terrain.

* * -122.0822035425683,37.42228990140251,0
*
*
*

* @author drew
* @author Andrew Kaluzniacki : drewk.net
* @copyright Andrew Kaluzniacki
*/
public class KmlFile {

private KmlFile() {

// empty
}

public static KmlFile create() {
return new KmlFile();
}
private ArrayList placemarkList = new ArrayList();

public void addPlacemark(double longitude, double latitude, String name, String description) {
placemarkList.add(new Placemark(longitude, latitude, 0, name, description));
}

public void saveAs(String filePath) {

FileOutputStream out = null;
try {
out = new FileOutputStream(filePath);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
throw new IllegalArgumentException(”SaveAs must have valid filePath:” + filePath);
}
PrintWriter pw = new PrintWriter(out);

pw.print(”< ?xml version=\"1.0\" encoding=\"UTF-8\"?>“);
pw.print(” “);

for (int i = 0; i < placemarkList.size(); i++) {
pw.print(((Placemark) placemarkList.get(i)).toXmlString());
}

pw.print(" “);

if (pw.checkError()) {
System.err.println(”Error in ” + pw.toString());
}

pw.close();
try {
out.close();
} catch (IOException ex) {
ex.printStackTrace();
throw new IllegalStateException(”Error in KmlFile::SaveAs()”);
} finally {
try {
out.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}

——–

/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
* Andrew Kaluzniacki
*
*/

/**
* Two key items:
* Placemarks - used to show aimpoints
* Polygon - used to show the target area
*
* Start:
* If I can just get plaecmarks working, I can fake target areas.
* The key is to get the addPlacemark(location, name)
*
*
* References:
* http://code.google.com/apis/kml/documentation/kml_tut.html#descriptive_html
*
*
*
* < ?xml version="1.0" encoding="UTF-8"?>
*
* * Simple placemark
* Attached to the ground. Intelligently places itself
* at the height of the underlying terrain.

* * -122.0822035425683,37.42228990140251,0
*
*
*

*
*
*/
package gekml;

/**
*
* @author drew
*/
public class Main {

/**
* @param args the command line arguments
*/
public static void main(String[] args) {

// Create a nice kml file
KmlFile kf = KmlFile.create();

// 32°25′34.65″N
// 106°40′13.68″W
kf.addPlacemark( -106.6253642521367, 32.35612182602554, “AndrewK”, “Is Success”);
kf.saveAs(”my.kml”);
}
}

——-
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/

package gekml;

/**
* * * Simple placemark
* Attached to the ground. Intelligently places itself
* at the height of the underlying terrain.

* * -122.0822035425683,37.42228990140251,0
*
*
* @author drew
* @author Andrew Kaluzniacki : drewk.net
* @copyright Andrew Kaluzniacki
*/
public class Placemark {

private double longitude;
private double latitude;
/** relative to ground - o is at ground level */
private double altitude;
private String name = “Untitled Placemark”;
private String description = “”;

Placemark(double longitude, double latitude, double altitude, String name, String description)
{
this.longitude = longitude;
this.latitude = latitude;
this.altitude = altitude;
if ( name != null ) {
this.name = name;
}
if ( description != null ) {
this.description = description;
}
}

/** Write to stream as xml
*
*/
String toXmlString() {
StringBuffer buf = new StringBuffer();
buf.append(” \n”);
buf.append(” “).append(name).append(”\n”);
buf.append(” “).append(description).append(”\n”);
buf.append(” \n”);
buf.append(” “).append(longitude).append(’,');
buf.append(latitude).append(’,').append(altitude).append(”
\n”);
buf.append(”
\n”);
buf.append(”
\n”);
return buf.toString();
}
}