Update Wordpress
December 28th, 2007Time to update.
done!
Time to update.
done!
I am working with a number of people who are convinced that the application we are developing is inherently computationally intensive; and that is why we are not getting performance that some number of other people think we should be getting. In my experience, folks who build a slow application due to design issues tend to blame the hardware, the compiler, the complexity of the application, but only lastly and only after being shown it is the case.
In my case, I am highly suspect of the design, but I have to convince the folks making the decision that it is not a question of hardware.
Theoretical Peak Performance
How much power does a modern quad core Xeon have?
XEON “Clovertown” retires 4 floating point operations [FLOPS] per clock, so a 3GHz “Clovertown” should do 3 billion x 4 FLOPS x 4 cores per second, or 48 GFLOPS theoretical peak with 64-bit precision. That’s 48 billion double precision floating point operations per second.
List_of_Intel_Xeon_microprocessors
Intel Core Architecture
How does this relate in terms of our problem?
Assume much of our work is 3×3 float matrix multiply [MM]. One 3×3 MM is 3 multiplys and 2 adds for each of 9 elements = 27 multiplys and 18 adds, or 55 FLOPs. At peak, “Clovertown” should do 873 million 3×3 MM per second. The number of configurations I’d like to check per second is about 10,000. I should be able to devote 87,000 MM per configuration per second. Based on what I guess of our computational coasts, this seems like plenty of juice.
Even if I only had one core to use in a single thread, I would have 20,000 MM per configuration. This should put to rest our initial desire to use 8 different CPUs and take the added complexity of IPC.
What is the computational cost of our problem?
This question is something I have not been able to answer, nor have anyone answer adequately. I need to find some good docs preferably, or less ideally do some call stack tracing through the code.
Profiling Issues
Presently, profiling with VTune suggests that I have about eight computational domains each taking between 10 and 15 percent of CPU. This is not what one wants for an easy speed fix. I could optimize out any one of these domains and only see a 10% increase in speed.
At one point malloc/free was taking around 30%, and I put Google Perf Tools to work which brought it down to 10%. This was a very nice win. Only took a few days to do and required no change in architecture, just adding the right link line.
Cache Miss City
One of the possible problems that would really bring performance down is cache misses. The current architecture is base on dealing with single datums at a time rather than whole arrays of data. The genesis of these design decisions rests squarely with managements instance on a Model Driven Architectural way of doing things; also possibly from the developers wholesale embrace of object oriented mis-thinking.
Network Bandwidth
One datum at a time is what I suspect to be one of the pillars of our problem. Why is this so costly? Because we kill the cache. We give up both code and data locality. We guarantee that each datum requires at least two context switches.
Switching from one running process to another running process incurs a cost to the system. The values of all the registers must be saved in the present state, the status of all open files must be recorded and the present position in the program must be recorded. Then the contents of the MMU must be stored for the process (see next chapter). Then all those things must be read in for the next process, so that the state of the system is exactly as it was when the scheduler last interrupted the process. This is called a context switch. Context switching is a system overhead. It costs real time and CPU cycles, so we don’t want to context switch too often, or a lot of time will be wasted.
Using Netperf 2.4.2
Let’s do a little digging.
[andrew]$ ./netperf -a 4096 -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec110592 110592 110592 10.00 20369.26 [andrew]$ ./netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 110592 110592 110592 10.00 16062.94 [andrew]$ bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -n 2 -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. GBytes /s % % us/KB us/KB 110592 110592 110592 10.00 2.48 99.90 99.90 0.768 6445907.000 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -n 2 -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. GBytes /s % % us/KB us/KB 110592 110592 110592 10.00 2.48 99.90 99.90 0.768 6445907.000 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -n 2 -t STREAM_STREAM -- -m 128 -M 128 STREAM STREAM TEST Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. GBytes /s % % us/KB us/KB 110592 110592 128 10.00 0.02 99.25 99.25 105.632 886107072.000 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -t STREAM_RR STREAM REQUEST/RESPONSE TEST Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr 110592 110592 1 1 10.00 44428.70 80.85 80.91 36.397 933560.938 110592 110592 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -t STREAM_RR -- -r 100,100 STREAM REQUEST/RESPONSE TEST Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr 110592 110592 100 100 10.00 45457.03 80.10 80.20 35.244 904538.250 110592 110592 bash-3.00$ ./netperf -a 4090 -A 4096 -c -C -f G -t STREAM_RR -- -r 100K,100K STREAM REQUEST/RESPONSE TEST Local /Remote Socket Size Request Resp. Elapsed Trans. CPU CPU S.dem S.dem Send Recv Size Size Time Rate local remote local remote bytes bytes bytes bytes secs. per sec % % us/Tr us/Tr 110592 110592 100 100 10.00 45404.84 80.20 80.35 35.328 907267.312 110592 110592
bash-3.00$ ./netperf -c -C -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Utilization Service Demand Socket Socket Message Elapsed Send Recv Send Recv Size Size Size Time Throughput local remote local remote bytes bytes bytes secs. 10^6bits/s % % us/KB us/KB 110592 110592 110592 10.00 16154.77 97.50 97.50 0.989 8294972.500
FLOPS for 3×3 Inversion
m3 = {{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}
{{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}
Inverse[m3]
{{(-a23 a32+a22 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33),(
a13 a32-a12 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33),(-a13 a22+a12 a23)/(-a13 a22 a31+
a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)},{(
a23 a31-a21 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33),(-a13 a31+a11 a33)/(-a13 a22 a31+
a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(
a13 a21-a11 a23)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33)},{(-a22 a31+a21 a32)/(-a13 a22 a31+
a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(
a12 a31-a11 a32)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-
a12 a21 a33+a11 a22 a33),(-a12 a21+a11 a22)/(-a13 a22 a31+
a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)}}
The denominator is common taking 18 multiplys and 5 adds. The numerators take 18 multiplys and 9 adds. Then there are 9 divides for the elements. That is 36 mutiplys, 9 divides, and 5 adds for a total of 50 FLOPS for a 3×3 inversion.
http://drewk.net/NetBeansProjects.zip
/*
* Writting a KML facility to help with FCS debugging.
*/
package gekml;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import java.util.ArrayList;
/**
* < ?xml version="1.0" encoding="UTF-8"?>
*
*
*
* at the height of the underlying terrain.
*
*
* @author drew
* @author Andrew Kaluzniacki : drewk.net
* @copyright Andrew Kaluzniacki
*/
public class KmlFile {
private KmlFile() {
// empty
}
public static KmlFile create() {
return new KmlFile();
}
private ArrayList placemarkList = new ArrayList();
public void addPlacemark(double longitude, double latitude, String name, String description) {
placemarkList.add(new Placemark(longitude, latitude, 0, name, description));
}
public void saveAs(String filePath) {
FileOutputStream out = null;
try {
out = new FileOutputStream(filePath);
} catch (FileNotFoundException ex) {
ex.printStackTrace();
throw new IllegalArgumentException(”SaveAs must have valid filePath:” + filePath);
}
PrintWriter pw = new PrintWriter(out);
pw.print(”< ?xml version=\"1.0\" encoding=\"UTF-8\"?>“);
pw.print(”
for (int i = 0; i < placemarkList.size(); i++) {
pw.print(((Placemark) placemarkList.get(i)).toXmlString());
}
pw.print(" “);
if (pw.checkError()) {
System.err.println(”Error in ” + pw.toString());
}
pw.close();
try {
out.close();
} catch (IOException ex) {
ex.printStackTrace();
throw new IllegalStateException(”Error in KmlFile::SaveAs()”);
} finally {
try {
out.close();
} catch (IOException ex) {
ex.printStackTrace();
}
}
}
}
——–
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
* Andrew Kaluzniacki
*
*/
/**
* Two key items:
* Placemarks - used to show aimpoints
* Polygon - used to show the target area
*
* Start:
* If I can just get plaecmarks working, I can fake target areas.
* The key is to get the addPlacemark(location, name)
*
*
* References:
* http://code.google.com/apis/kml/documentation/kml_tut.html#descriptive_html
*
*
*
* < ?xml version="1.0" encoding="UTF-8"?>
*
*
*
* at the height of the underlying terrain.
*
*
*
*
*/
package gekml;
/**
*
* @author drew
*/
public class Main {
/**
* @param args the command line arguments
*/
public static void main(String[] args) {
// Create a nice kml file
KmlFile kf = KmlFile.create();
// 32°25′34.65″N
// 106°40′13.68″W
kf.addPlacemark( -106.6253642521367, 32.35612182602554, “AndrewK”, “Is Success”);
kf.saveAs(”my.kml”);
}
}
——-
/*
* To change this template, choose Tools | Templates
* and open the template in the editor.
*/
package gekml;
/**
* *
*
* at the height of the underlying terrain.
*
*
* @author Andrew Kaluzniacki : drewk.net
* @copyright Andrew Kaluzniacki
*/
public class Placemark {
private double longitude;
private double latitude;
/** relative to ground - o is at ground level */
private double altitude;
private String name = “Untitled Placemark”;
private String description = “”;
Placemark(double longitude, double latitude, double altitude, String name, String description)
{
this.longitude = longitude;
this.latitude = latitude;
this.altitude = altitude;
if ( name != null ) {
this.name = name;
}
if ( description != null ) {
this.description = description;
}
}
/** Write to stream as xml
*
*/
String toXmlString() {
StringBuffer buf = new StringBuffer();
buf.append(”
buf.append(”
buf.append(”
buf.append(”
buf.append(”
buf.append(latitude).append(’,').append(altitude).append(”
buf.append(”
buf.append(”
return buf.toString();
}
}
Info on the 15in Studio Display
640×480, 720×400, 800×600, 832×624, 1024×768
Using 50-75 Hz Vertical refresh
Xserver: press ctrl+alt+’+’ to cycle resolutions
I’ve been using the CentOS Linux distribution for almost two years now. CentOS is based on Red Hat Enterprise. The reason I started using it in fact was that we were using RHEL on a project at work and I wanted to start using it at home so that I could become more adept with the OS, but of course I was not willing to pay for the commercial version on my own dime. To be fair though, I recommend RHEL to clients.
CentOS 4 has worked very nicely for me and now I have a chance to set up a new machine and I see that CentOS 5 is out.
Today’s question is what language to use to teach programming to a bright young child? I’ve seen the question before with answers ranging from Java to VB to Squeak.
Slashdot has this MIT Media Lab Making Programming Fun For Kids. Not surprising that there are some strong opinions in the thread.
Looks like Scratch is based on Squeak which in turn is based on Smalltalk. Smalltalk sound about the right speed! I’ll have to look at both and see what I think.
Scratch is perfect for learning to program. I very much like the structuring of control, motion, sense, variable. The visual iconography of control structures wrapping statements is ingenious.
In a few hours we had a pong style game going. It is beautiful. Of course I can also see where some of the limits will occur. I don’t see any programmatic method of sprite creation or destruction.
I’d like to mount my Linux web directories on my Mac. SMB would seem appropriate.I think I have enabled SMB on linux, but I can not seem to get the finder to connect.smb://yyy.yyy.yyy.yyy which is not a valid address gets me nowhere.smb://xxx.xxx.xxx.xxx is valid and I can not connect because name or password is incorrect.smb://xxx.xxx.xxx.xxx/html where html is the share name also fails.SMBClient may be useful for debugging.Ok, it seems to be a firewall/selinux issue. When I turn those off I can connect.System Settings:Security Level:
Something broke. Ok, I needed smb://192.168.1.22/html in the dialog.
One of the most difficult data recovery tasks I’ve done in my career.
Well that went pretty well. A good crowd of 30 of our engineers and a couple folks from management checking in.
The end result is here:
OpenMP Powerpoint Presentation
Reminder to self - prepare a 1 hour presentation on OpenMP. Compare OpenMP to other solutions.
Notes:
(from Intel Thread Building Blocks Book) “For instance, Intel’s Math Kernel Library
(MKL) and Integrated Performance Primitives (IPP) library are implemented
internally using OpenMP. You can freely link a program using Threading Building
Blocks with the Intel MKL or Intel IPP library.?
I am working on a site for Friends of Modern Architecture in Lincoln, which is expected to be low traffic but informational to those who need it. Who to have host it? This was a question I asked a few years ago for this very site. It has to be inexpensive, and I prefer Linux. Godaddy has some pretty nice Linux options at less than $8 per month.
Can I get secure FTP on my godaddy server? Hmmm, not that I can see.