Notable Hacks by 7B Software, Inc.

Home

Contact

About 7B Software, Inc.

Professional Resume

Notable Hacks
from 7B Software

Programming Techniques
from 7B Software

Links and Resources

This page contains some of the interesting "hacks" that I've had the pleasure of executing during my career. I'll be fairly careful here to only include hacks that, so far as I know, nobody else has done.


The "login -r me" Unix hack. In the early 90's, when Unix servers were gaining popularity (and prior to the vast popularity of the Web), I discovered a flaw in the login program implemented by a large variety of Unixes.

The hack would enable anybody to remotely log into the Unix machine using telnet. The flaw was due to the login program not being able to distinguish between an automatic remote login and a manual one. (In those days, automatic remote logins were supported by the -r option, for use by commands such as rsh. These days, we're a bit smarter and typically use the much stronger ssh.)

After I convinced myself that this flaw was sufficiently epidemic--by remotely logging into about twenty different Sun, Berkeley Unix, and AT&T Unix machines on the Net--I reported the flaw to Steve Bellovin, who quietly got all of the Unix manufacturers to fix the flaw. I have never seen this bug published in any hacker sites. Nor do I know of any Unix machines that still have the flaw, so I feel pretty safe in publishing it.


The "stale stack" kernel hack.

I was once contracted to discover why a very large email server farm was unreliable. The symptom was that, at a rate of about once per month, a Unix machine used as an email forwarding box would crash. The crash was in the Unix kernel, and a stack trace was available.

There was no good way to reproduce the crash directly: no amount of load would cause it to happen with a period greater than roughly one month. Instead, we were forced to solve this problem using only the kernel stack and memory dump that was left behind after the crash.

This story would be pretty darn uninteresting if it were just about finding yet another kernel bug. Spelunking a kernel stack is a fairly obscure skill, but it certainly isn't particularly novel. Typically, you go looking for a pointer on the stack that has a weird value in it (like 0, or an ascii value), figure out which variable in the kernel it correlates to, and backtrack into the code to find the spot where it was modified. From there, you can usually hone in on the bug by enumerating all the possible modification scenarios; eventually you find a modification that's just not within spec, and you're done.

This one was more difficult, though. After some analysis, I was able to prove that, at the time of the crash, the "bad" variable simply wasn't in the stack. I thought about this awhile, and realized that this was one of those really nasty bugs where the bad value was being set perhaps thousands of CPU cycles prior to the actual crash. And the upshot of this was that the stack didn't contain the bad variable value; it had long been overwritten by later executions.

So, what do you do when the only tool at your disposal doesn't appear to be enough to solve the problem?

In my case, I decided to try something I'd never heard anybody else trying. I'd look for evidence left behind by previous executions, in the hope that there was a smoking gun in there.

When you take a snapshot of kernel memory and have a look at the stack, you can see all of the modified stack variables from the stack of executing kernel subroutines. But, on that very same stack dump, there are also "left behind" modifications: values of variables of kernel subroutines that are not currently running or suspended. This happens simply because, when a subroutine returns, its stack variables aren't erased; instead, the stack frame is simply reset ("popped").

By doing some careful analysis of how the C compiler allocates stack variables, and by limiting my approach by looking for a few variables that were likely to have been the source of the problem, I was able to find the "bad value" within the "stale" part of the stack. In other words, there was a value in the kernel stack dump that was out of spec--but the value wasn't for any of the variables in a currently executing or suspended routine. The value was for a variable in some subroutine that had finished executing some time in the past.

Once I had verified which variable was the bad one, I was able to attack the problem like most other kernel bugs. And eventually, I was able to find the bug. Deep, deep within the assembly code of the Unix scheduler, there was a mutual exclusion violation that would, with the period of about once a month, allow a network driver to run unsafely, overwriting the "bad" variable. The device driver interrupt would finish; then the suspended task would run, try to dereference the bad variable, and it's crash time.

I suppose that this particular bugfix owes more to sheer doggedness than anything else. Typically, a kernel hacker would have the ability to inject data into the system to provoke the bug, and so my "stale stack" technique just wouldn't be necessary. But I didn't have that option.


Member, Association
for Computing Machinery
Last updated: July 19, 2010