Update Wordpress
Friday, December 28th, 2007Time to update.
done!
Time to update.
done!
I am working with a number of people who are convinced that the application we are developing is inherently computationally intensive; and that is why we are not getting performance that some number of other people think we should be getting. In my experience, folks who build a slow application due to design issues tend to blame the hardware, the compiler, the complexity of the application, but only lastly and only after being shown it is the case.
In my case, I am highly suspect of the design, but I have to convince the folks making the decision that it is not a question of hardware.
Theoretical Peak Performance
How much power does a modern quad core Xeon have?
XEON “Clovertown” retires 4 floating point operations [FLOPS] per clock, so a 3GHz “Clovertown” should do 3 billion x 4 FLOPS x 4 cores per second, or 48 GFLOPS theoretical peak with 64-bit precision. That’s 48 billion double precision floating point operations per second.
List_of_Intel_Xeon_microprocessors
Intel Core Architecture
How does this relate in terms of our problem?
Assume much of our work is 3×3 float matrix multiply [MM]. One 3×3 MM is 3 multiplys and 2 adds for each of 9 elements = 27 multiplys and 18 adds, or 55 FLOPs. At peak, “Clovertown” should do 873 million 3×3 MM per second. The number of configurations I’d like to check per second is about 10,000. I should be able to devote 87,000 MM per configuration per second. Based on what I guess of our computational coasts, this seems like plenty of juice.
Even if I only had one core to use in a single thread, I would have 20,000 MM per configuration. This should put to rest our initial desire to use 8 different CPUs and take the added complexity of IPC.
What is the computational cost of our problem?
This question is something I have not been able to answer, nor have anyone answer adequately. I need to find some good docs preferably, or less ideally do some call stack tracing through the code.
Profiling Issues
Presently, profiling with VTune suggests that I have about eight computational domains each taking between 10 and 15 percent of CPU. This is not what one wants for an easy speed fix. I could optimize out any one of these domains and only see a 10% increase in speed.
At one point malloc/free was taking around 30%, and I put Google Perf Tools to work which brought it down to 10%. This was a very nice win. Only took a few days to do and required no change in architecture, just adding the right link line.
Cache Miss City
One of the possible problems that would really bring performance down is cache misses. The current architecture is base on dealing with single datums at a time rather than whole arrays of data. The genesis of these design decisions rests squarely with managements instance on a Model Driven Architectural way of doing things; also possibly from the developers wholesale embrace of object oriented mis-thinking.
Network Bandwidth
One datum at a time is what I suspect to be one of the pillars of our problem. Why is this so costly? Because we kill the cache. We give up both code and data locality. We guarantee that each datum requires at least two context switches.
Switching from one running process to another running process incurs a cost to the system. The values of all the registers must be saved in the present state, the status of all open files must be recorded and the present position in the program must be recorded. Then the contents of the MMU must be stored for the process (see next chapter). Then all those things must be read in for the next process, so that the state of the system is exactly as it was when the scheduler last interrupted the process. This is called a context switch. Context switching is a system overhead. It costs real time and CPU cycles, so we don’t want to context switch too often, or a lot of time will be wasted.
Using Netperf 2.4.2
Let’s do a little digging.
[andrew]$ ./netperf -a 4096 -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec110592 110592 110592 10.00 20369.26 [andrew]$ ./netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 110592 110592 110592 10.00 16062.94 [andrew]$ bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -n 2 -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. GBytes /s % % us/KB us/KB 110592 110592 110592 10.00 2.48 99.90 99.90 0.768 6445907.000 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -n 2 -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. GBytes /s % % us/KB us/KB 110592 110592 110592 10.00 2.48 99.90 99.90 0.768 6445907.000 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -n 2 -t STREAM_STREAM -- -m 128 -M 128 STREAM STREAM TEST Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. GBytes /s % % us/KB us/KB 110592 110592 128 10.00 0.02 99.25 99.25 105.632 886107072.000 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -t STREAM_RR STREAM REQUEST/RESPONSE TEST Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr 110592 110592 1 1 10.00 44428.70 80.85 80.91 36.397 933560.938 110592 110592 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -t STREAM_RR -- -r 100,100 STREAM REQUEST/RESPONSE TEST Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr 110592 110592 100 100 10.00 45457.03 80.10 80.20 35.244 904538.250 110592 110592 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -t STREAM_RR -- -r 100K,100K STREAM REQUEST/RESPONSE TEST Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr 110592 110592 100 100 10.00 45404.84 80.20 80.35 35.328 907267.312 110592 110592
bash-3.00$ ./netperf -c -C -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % % us/KB us/KB 110592 110592 110592 10.00 16154.77 97.50 97.50 0.989 8294972.500
FLOPS for 3×3 Inversion
m3 = {{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}
{{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}
Inverse[m3]
{{(-a23 a32+a22 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33),(
a13 a32-a12 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33),(-a13 a22+a12 a23)/(-a13 a22 a31+
a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)},{(
a23 a31-a21 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33),(-a13 a31+a11 a33)/(-a13 a22 a31+
a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(
a13 a21-a11 a23)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33)},{(-a22 a31+a21 a32)/(-a13 a22 a31+
a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(
a12 a31-a11 a32)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33),(-a12 a21+a11 a22)/(-a13 a22 a31+
a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)}}
The denominator is common taking 18 multiplys and 5 adds. The numerators take 18 multiplys and 9 adds. Then there are 9 divides for the elements. That is 36 mutiplys, 9 divides, and 5 adds for a total of 50 FLOPS for a 3×3 inversion.
http://drewk.net/NetBeansProjects.zip
/*
* Writting a KML facility to help with FCS debugging.
*/
package gekml;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
/**
* < ?xml version="1.0" encoding="UTF-8"?>
*
*
*
* at the height of the underlying terrain.
*
*
* @author drew
* @author Andrew Kaluzniacki : drewk.net
* @copyright Andrew Kaluzniacki
*/
public class KmlFile {
private KmlFile() {
// empty
}
public static KmlFile create() {
return new KmlFile();
}
private ArrayList placemarkList = new ArrayList();
public void addPlacemark(double longitude, double latitude, String name, String description) {
placemarkList.add(new Placemark(longitude, latitude, 0, name, description));
}
public void saveAs(String filePath) {
FileOutputStream out = null;
try {
out = new FileOutputStream(filePath);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
throw new IllegalArgumentException(”SaveAs must have valid filePath:” + filePath);
}
PrintWriter pw = new PrintWriter(out);
pw.print(”< ?xml version=\"1.0\" encoding=\"UTF-8\"?>“);
pw.print(”
for (int i = 0; i < placemarkList.size(); i++) {
pw.print(((Placemark) placemarkList.get(i)).toXmlString());
}
pw.print(" “);
if (pw.checkError()) {
System.err.println(”Error in ” + pw.toString());
}
pw.close();
try {
out.close();
} catch (IOException ex) {
ex.printStackTrace();
throw new IllegalStateException(”Error in KmlFile::SaveAs()”);
} finally {
try {
out.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}
——–
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
* Andrew Kaluzniacki
*
*/
/**
* Two key items:
* Placemarks - used to show aimpoints
* Polygon - used to show the target area
*
* Start:
* If I can just get plaecmarks working, I can fake target areas.
* The key is to get the addPlacemark(location, name)
*
*
* References:
* http://code.google.com/apis/kml/documentation/kml_tut.html#descriptive_html
*
*
*
* < ?xml version="1.0" encoding="UTF-8"?>
*
*
*
* at the height of the underlying terrain.
*
*
*
*
*/
package gekml;
/**
*
* @author drew
*/
public class Main {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
// Create a nice kml file
KmlFile kf = KmlFile.create();
// 32°25′34.65″N
// 106°40′13.68″W
kf.addPlacemark( -106.6253642521367, 32.35612182602554, “AndrewK”, “Is Success”);
kf.saveAs(”my.kml”);
}
}
——-
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package gekml;
/**
* *
*
* at the height of the underlying terrain.
*
*
* @author Andrew Kaluzniacki : drewk.net
* @copyright Andrew Kaluzniacki
*/
public class Placemark {
private double longitude;
private double latitude;
/** relative to ground - o is at ground level */
private double altitude;
private String name = “Untitled Placemark”;
private String description = “”;
Placemark(double longitude, double latitude, double altitude, String name, String description)
{
this.longitude = longitude;
this.latitude = latitude;
this.altitude = altitude;
if ( name != null ) {
this.name = name;
}
if ( description != null ) {
this.description = description;
}
}
/** Write to stream as xml
*
*/
String toXmlString() {
StringBuffer buf = new StringBuffer();
buf.append(”
buf.append(”
buf.append(”
buf.append(”
buf.append(”
buf.append(latitude).append(’,').append(altitude).append(”
buf.append(”
buf.append(”
return buf.toString();
}
}