Archive for the ‘Software Creation’ Category

Using OpenMP at work

Sunday, June 1st, 2008

One of the developers in another group saw my OpenMP presentation and decided it would fit into the design of his current project. They are using an eight core machine to do spectral analysis and need to run a dozen or so models against the data. With much less work than it would take to multi-thread by hand, in fact it may only take a day or two including testing, we expect to get at least a factor of five improved run time performance. The hardest part may be convncing the client the the use of OpenMP is not too esoteric.

Hello Cow : my first iPhone App

Thursday, March 20th, 2008

I was lucky enough to get in on the iPhone early SDK list and I’ve written my first simple application; HelloCow! Similar in concept and complexity to Hello World, the application consists of a picture of a cow. When you touch the cow, she moos. My little boy loves cows and he loves my iPhone, so I’m hoping this keeps him entertained without letting him dial 911 accidentally.  Hello Cow The permanent page is Hello Cow.

The image of the cow is graciously provided by Jordan McClements who has many beautiful images at jmcwd.com.

Computational Performance Art

Friday, December 28th, 2007

I am working with a number of people who are convinced that the application we are developing is inherently computationally intensive; and that is why we are not getting performance that some number of other people think we should be getting. In my experience, folks who build a slow application due to design issues tend to blame the hardware, the compiler, the complexity of the application, but only lastly and only after being shown it is the case.
In my case, I am highly suspect of the design, but I have to convince the folks making the decision that it is not a question of hardware.

Theoretical Peak Performance
How much power does a modern quad core Xeon have?
XEON “Clovertown” retires 4 floating point operations [FLOPS] per clock, so a 3GHz “Clovertown” should do 3 billion x 4 FLOPS x 4 cores per second, or 48 GFLOPS theoretical peak with 64-bit precision. That’s 48 billion double precision floating point operations per second.
List_of_Intel_Xeon_microprocessors
Intel Core Architecture

How does this relate in terms of our problem?
Assume much of our work is 3×3 float matrix multiply [MM]. One 3×3 MM is 3 multiplys and 2 adds for each of 9 elements = 27 multiplys and 18 adds, or 55 FLOPs. At peak, “Clovertown” should do 873 million 3×3 MM per second. The number of configurations I’d like to check per second is about 10,000. I should be able to devote 87,000 MM per configuration per second. Based on what I guess of our computational coasts, this seems like plenty of juice.
Even if I only had one core to use in a single thread, I would have 20,000 MM per configuration. This should put to rest our initial desire to use 8 different CPUs and take the added complexity of IPC.

What is the computational cost of our problem?
This question is something I have not been able to answer, nor have anyone answer adequately. I need to find some good docs preferably, or less ideally do some call stack tracing through the code.

Profiling Issues
Presently, profiling with VTune suggests that I have about eight computational domains each taking between 10 and 15 percent of CPU. This is not what one wants for an easy speed fix. I could optimize out any one of these domains and only see a 10% increase in speed.
At one point malloc/free was taking around 30%, and I put Google Perf Tools to work which brought it down to 10%. This was a very nice win. Only took a few days to do and required no change in architecture, just adding the right link line.

Cache Miss City
One of the possible problems that would really bring performance down is cache misses. The current architecture is base on dealing with single datums at a time rather than whole arrays of data. The genesis of these design decisions rests squarely with managements instance on a Model Driven Architectural way of doing things; also possibly from the developers wholesale embrace of object oriented mis-thinking.

Network Bandwidth
One datum at a time is what I suspect to be one of the pillars of our problem. Why is this so costly? Because we kill the cache. We give up both code and data locality. We guarantee that each datum requires at least two context switches.

Switching from one running process to another running process incurs a cost to the system. The values of all the registers must be saved in the present state, the status of all open files must be recorded and the present position in the program must be recorded. Then the contents of the MMU must be stored for the process (see next chapter). Then all those things must be read in for the next process, so that the state of the system is exactly as it was when the scheduler last interrupted the process. This is called a context switch. Context switching is a system overhead. It costs real time and CPU cycles, so we don’t want to context switch too often, or a lot of time will be wasted.

Using Netperf 2.4.2
Let’s do a little digging.

[andrew]$ ./netperf -a 4096 -t STREAM_STREAM

STREAM STREAM TEST

Recv   Send    Send

Socket Socket  Message  Elapsed

Size   Size    Size     Time     Throughput

bytes  bytes   bytes    secs.    10^6bits/sec110592  110592  110592    10.00    20369.26

[andrew]$ ./netperf  -t STREAM_STREAM

STREAM STREAM TEST

Recv   Send    Send

Socket Socket  Message  Elapsed

Size   Size    Size     Time     Throughput

bytes  bytes   bytes    secs.    10^6bits/sec

110592  110592  110592    10.00    16062.94

[andrew]$

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592  110592    10.00      2.48   99.90  99.90   0.768   6445907.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592  110592    10.00         2.48   99.90  99.90   0.768   6445907.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G -n 2  -t STREAM_STREAM -- -m 128 -M 128
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    GBytes  /s  %      %       us/KB   us/KB

110592  110592     128    10.00         0.02   99.25  99.25   105.632  886107072.000

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 1       1      10.00   44428.70   80.85  80.91  36.397  933560.938
110592 110592

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR -- -r 100,100
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 100     100    10.00   45457.03   80.10  80.20  35.244  904538.250
110592 110592

bash-3.00$ ./netperf -a 4090 -A 4096 -c -C  -f G   -t STREAM_RR -- -r 100K,100K
STREAM REQUEST/RESPONSE TEST
Local /Remote
Socket Size   Request Resp.  Elapsed Trans.   CPU    CPU    S.dem   S.dem
Send   Recv   Size    Size   Time    Rate     local  remote local   remote
bytes  bytes  bytes   bytes  secs.   per sec  %      %      us/Tr   us/Tr

110592 110592 100     100    10.00   45404.84   80.20  80.35  35.328  907267.312
110592 110592

bash-3.00$ ./netperf -c -C -t STREAM_STREAM
STREAM STREAM TEST
Recv   Send    Send                          Utilization    Service Demand
Socket Socket  Message  Elapsed              Send   Recv    Send    Recv
Size   Size    Size     Time     Throughput  local  remote  local   remote
bytes  bytes   bytes    secs.    10^6bits/s  %      %       us/KB   us/KB

110592  110592  110592    10.00      16154.77   97.50  97.50   0.989   8294972.500

FLOPS for 3×3 Inversion

m3 = {{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}

{{a11,a12,a13},{a21,a22,a23},{a31,a32,a33}}

Inverse[m3]

{{(-a23 a32+a22 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(

        a13 a32-a12 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a13 a22+a12 a23)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)},{(

        a23 a31-a21 a33)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a13 a31+a11 a33)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(

        a13 a21-a11 a23)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33)},{(-a22 a31+a21 a32)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33),(

        a12 a31-a11 a32)/(-a13 a22 a31+a12 a23 a31+a13 a21 a32-a11 a23 a32-

          a12 a21 a33+a11 a22 a33),(-a12 a21+a11 a22)/(-a13 a22 a31+

          a12 a23 a31+a13 a21 a32-a11 a23 a32-a12 a21 a33+a11 a22 a33)}}

The denominator is common taking 18 multiplys and 5 adds. The numerators take 18 multiplys and 9 adds. Then there are 9 divides for the elements. That is 36 mutiplys, 9 divides, and 5 adds for a total of 50 FLOPS for a 3×3 inversion.

OpenMP Presentation

Sunday, July 15th, 2007

 Well that went pretty well. A good crowd of 30 of our engineers and a couple folks from management checking in.

The end result is here:

OpenMP Powerpoint Presentation

Reminder to self - prepare a 1 hour presentation on OpenMP. Compare OpenMP to other solutions.

Notes:
(from Intel Thread Building Blocks Book) “For instance, Intel’s Math Kernel Library
(MKL) and Integrated Performance Primitives (IPP) library are implemented
internally using OpenMP. You can freely link a program using Threading Building
Blocks with the Intel MKL or Intel IPP library.?

Godaddy

Sunday, July 15th, 2007

I am working on a site for Friends of Modern Architecture in Lincoln, which is expected to be low traffic but informational to those who need it. Who to have host it? This was a question I asked a few years ago for this very site. It has to be inexpensive, and I prefer Linux. Godaddy has some pretty nice Linux options at less than $8 per month.

Can I get secure FTP on my godaddy server? Hmmm, not that I can see.

ASP.NET

Monday, May 14th, 2007

I need to show some ASP.NET example work. Ok, I have a proxy from apache to my XP box. And I have IIS set up. I can serve index.htm “HELLO”, why can I not get an asp page showing. No doubt I need to configure to serve ASP. Ok, not sure but maybe it was the reboot? Nope - ISS won’t serve an empty file apparently. I needed the “Good Morning” text to be visible. Now I need a simple ASP example.
ASPX is the file extension for an ASP.NET page. This is in fact the successor to ASP.

Expect another tool to learn

Wednesday, February 28th, 2007

I’m going over what I hope is a fool proof development install procedure for a large project. I need to include specific versions of gcc, and I think it might be nice to run the gcc tests, since all of this is automated anyway. Which brought me to expect. Sounds like I would do well to know this tool, and possibly it will be useful in the testing of the very project I’m working on.

While writing up the plan I have been assuming we would use svn. I really like svn, but it may be that git is better suited as one of the difficult constraints I have is supporting development on air gapped machines. One comparison note appears here. A long paper on Distributed (git) vs. Centralized (svn) development appears here.

Java const

Wednesday, February 21st, 2007

It’s been mentioned elsewhere, but I will mention it here. Java needs const; and it should work just like it does in C++. Some people will tell you that final suffices, but it does not.

Look into where I should complain.

David Tribble has a complete exposition on what const should look like in Java here.

Eclipse seems slow

Wednesday, November 15th, 2006

I’ve seen this claim before, that Eclipse is slow, but I had not experienced the issue quite so personally before. Most of my Eclipse development to date had been small components to other projects, or in one case a project that might have been about one thousand files. Now I am trying to turn a project that comprises some 40,000 files into an Eclipse project(s) and it is not going as I hoped. Subclipse takes tens of minutes in some operations, the Binary Search (CDT?) that gets started on Eclipse launch takes about 5 minutes. And for other reasons, the integrated gdb view does not seem to be working with the executable because of a time out during launch; though this could be a simple configuration problem. In any case, it looks like I’ll have to spend much more time breaking the project into small bits, which should be done anyway, I had just hoped to start building and debugging before I had to do house keeping.

I know that Subversion can handle the project size, and I know that make is suitable for the level of complexity involved in building everything, I just want to put it all in a nice integrated environment for those of us that would prefer not to use gdb from the command line, or who would like to use the refactoring tools in Eclipse rather than grep and sed!

VNC and CentOS

Tuesday, November 14th, 2006

I had some trouble getting VNC to connect to my CentOS box. The firewall prevented the connection.

The fix is simple enough:
[root]#/sbin/iptables -I RH-Firewall-1-INPUT -m state --state NEW -p tcp --destination-port 5901 -j ACCEPT

Saving the fix so it remains on boot seems a little more challenging; naturally I am mixed between the command line and graphical. I would like to know why my VNC connection seems to hang sometimes.

[root@sv ~]# apropos vnc
Xvnc (1) - the X VNC server
vncconfig (1) - configure and control a VNC server
vncpasswd (1) - change a VNC password
vncserver (1) - start or stop a VNC server
x0vncserver (1) - VNC server which continuously polls an X display

Prefer vncserver to Xvnc

Seems like the desktop from my vnc server gets hosed after a few months. I assume it is gnome, but I would think I get a new gnome session with each login. vncserver -kill :1 did the trick.

vncserver :1 -name G5 -depth 16 -geometry 1400x1000