RSS
What's This?

Late Night At Blenz Coffee

I'm on the hunt for an elusive bug.

Michael David Crawford
January 29, 2007

Copyright © 2007 Michael David Crawford. All Rights Reserved.

It's twenty minutes till one on a Sunday night at the Blenz Coffee at Robson and Burrard in Vancouver. I have been awake less than two hours; I've been sleeping very irregularly lately as I have been working some long days searching for the solution to a critical bug in my company's software.

Wednesday and Thursday I worked a thirty-hour day, then went home and slept another thirty hours.

There aren't many people in here, just a few night owls.

In a few minutes, after I post this, I'll walk to my office in Gastown and resume my bug hunt. It's very important that I fix it by morning.


I'm on my third project with the company. I did well on my first project, so they had great hope for me for the second, but I'm afraid I was completely flummoxed by it. I should have told them I was having trouble, I should have asked for help, but I didn't want to admit defeat, I didn't want to disappoint them, and so I never asked for help, and as a result I put the whole project behind schedule.

Yes, that's right, I fucked up a software project. I will gladly eat crow.

My project manager went on vacation. The day before he returned, another project manager asked to speak to me privately, and asked if I was having trouble, and it all came pouring out: I confessed that I did not have a clue.

He asked what I might be better at, and I said "You know what I'm really good at? Debugging. I'm one of the best in the business at debugging."

The next morning, he asked if I'd like to fix a bug on a different project, and I said that I would gladly.

I had a chance to redeem myself!

A problem is that the bug is either in a driver my teammates wrote, or in some of our userspace software, or the Mac OS X kernel, or one of Apple's drivers, or some of OS X' userspace software. That's a lot of territory to hunt through - the bug could be anywhere.

I'm pretty sure now that I know the immediate cause of the failure. I have not yet figured out how it gets to be that way. The bug happens every time in one machine configuration, and never happens in another. I don't know why.

I am set up to use the GNU gdb two-machine debugger to debug the OS X kernel and our driver. If you build the kernel from source, and give gdb the path to a copy of the kernel that has debugging symbols, you can do source code debugging of the kernel. It's quite luxurious after all the years I did MacsBug assembly debugging of the Classic Mac.

(The OS X kernel, and many other components of OS X are Open Source.)

The gdb user interface runs on one machine connected via ethernet to a debugger stub of the machine under test. One can configure the test machine to drop into the debugger at boot, so you can set breakpoints, or you can configure the power button to generate a non-maskable interrupt so you can drop into the debugger at any time.

It turned out that the instructions for compiling the Intel OS X kernel got taken offline for some reason. I was able to piece it together using Google and searching mailing list archives.

They're playing some nice music at the cafe tonight... I saw your face, in a crowded place, and I don't know what to do, cause I'll never be with you.

OS X has quite a nice driver architecture. The OS X I/O Kit is written in C++, as are the drivers. All the drivers are dynamically loaded.

The kernel is built on top of Mach, not by using a microkernel, but by statically linking it. That avoids the performance problems caused by frequent context switches, while allowing use of Mach's features such as messaging. Userspace code and drivers communicate with each other via Mach messaging.

I don't want to say exactly why it's important to my company that I fix this bug, and fix it by morning, but it's important. To get it fixed, in time, would make a real difference to all of us where I work.

It's time to go to fix my bug now.

RSS
What's This?