Making sense of (k)ksymoops output aka ''Oops decoding''
or
How to get to the offending line, even in a module
'a mini-HOWTO'
bert hubert <ahu@ds9a.nl>

If your kernel decides that something just happened which should not have, it may 'Oops', which indicates that this is a Bad Thing. With this Oops, it prints out a so called stack trace. This stack trace can tell you, if you manage to decode it, where the kernel was when things went wrong.

A raw stack trace is in the form of addresses via which the kernel arrived at the offending location. This looks something like this:
Oct 20 19:21:32 hubert kernel: Stack: c8831060 c7643280 c88330dd c7643280 c8838800 c78c6000 c77281e0 c1327e00 
Oct 20 19:21:32 hubert kernel: c88330f5 c8862c84 c8831acb c8838800 c8859fe0 c8862c40 00000042 00000001
Oct 20 19:21:32 hubert kernel: c885137c c8861280 c1327e00 000493e0 c78c6000 c8862520 c8862520 c12826c0
This is not much use as these numbers refer to addresses that are specific to your kernel.

So if you only see numbers in your trace, you first need to decode these with 'ksymoops'. This is not very hard but does require the System.map of your running kernel and, if you were running with modules, that you load the same modules, and in the same order as when the Oops happened. This may be easy if the Oops did not kill your machine but otherwise it can be a bit of a burden. With recent 2.5 kernels, this decoding has already been done for you:

kksymoops

As of Linux 2.5.35 or so, ksymoops functionality is built into the kernel, which is called 'kksymoops'. Be sure to turn the relevant kernel option (' Load all symbols for debugging/kksymoops', in the 'Kernel Hacking' menu) on when compiling. Your distribution will surely have done that for you. This means a slighly bigger kernel image but does not degrade performance in any way.

After decoding, you get a 'call trace' that looks like this:
Oct 20 19:21:32 hubert kernel: Call Trace:
Oct 20 19:21:32 hubert kernel: [<c8831060>] auth_domain_drop+0x50/0x60 [sunrpc]
Oct 20 19:21:32 hubert kernel: [<c88330dd>] cache_clean+0x16d/0x170 [sunrpc]
Oct 20 19:21:32 hubert kernel: [<c8838800>] auth_domain_cache+0x0/0x60 [sunrpc]
Oct 20 19:21:32 hubert kernel: [<c88330f5>] cache_flush+0x15/0x50 [sunrpc]
Oct 20 19:21:32 hubert kernel: [<c8862c84>] hash_sem+0x0/0x1c [nfsd]
Oct 20 19:21:32 hubert kernel: [<c8831acb>] svcauth_unix_purge+0x1b/0x20 [sunrpc]
Which is pretty nice already.  So far, we see that auth_domain_drop() caused the problem, and that it was called by cache_clean() which was called by auth_domain_cache() etc. Often this suffices to find out what went wrong, but sometimes we want more details yet.

Now to get the line number, the long way

The following is a different example, one that does not involve modules. This is why we suddenly trace 0xc01cf0d1, and not 'auth_domain_drop()'.

The canonical way to get from this dump to the line number is to recompile the entire kernel with debugging, which is done by adding '-g' to the toplevel Makefile CFLAGS, and then executing the following on the 'vmlinux' uncompressed kernel binary produced:
$ addr2line -e ./vmlinux 0xc01cf0d1
/mnt/linux-2.5.26/include/asm/bitops.h:244
or
$ gdb ./vmlinux
...
(gdb) l *0xc01cf0d1
0xc01cf0d1 is in read_chan (include/asm/bitops.h:244).
(...)
244 return ((1UL << (nr & 31)) & (((const volatile unsigned int *) addr)[nr >> 5])) != 0;
(...)
Note that this even works with raw numbers, without parsing through (k)symoops!

However, for several reasons this may not be what you want:
Because of these reasons, the output above is not from the stacktrace mentioned earlier. That oops occurred in a module and neither addr2line nor gdb can help us with that.

So, what can we do? Well, besides the raw address where the oops occurred, 0xc8831060 in this case, the decoded version is kind enough to provide us with a relative location of the problem. This is the second part of the top line above, 'auth_domain_drop+0x50/0x60'. By this, we are told that the offending address is 0x50 bytes from the start of the function auth_domain_drop() which is 0x60 bytes long in total.

The quick way

Now that we have a relative location, gdb can tell us where the exact problem is, provided that we help it a bit. As we are not going to recompile our entire kernel with debugging some work is required. Furthermore, we are working on a call trace from a module, even recompiling the entire kernel would not help as the error is in a module.

We start out with the closest sources we can get, perferably those from the same kernel version which caused the oops. The best way of  course is to use the exact tree from which your image was built. If you try to recreate a copy of a kernel that is already running from different sources, you must read the caveats section below. Not advised.

In our case, we know the problem is in the function auth_domain_drop(), which we now need to locate. There are several ways of doing this, a very blunt but effective one is:
$ find . -name "*.c" | xargs -n 500 grep auth_domain_drop    
./net/sunrpc/svcauth.c:void auth_domain_drop(struct cache_head *item, struct cache_detail *cd)
./net/sunrpc/svcauth.c: .cache_put = auth_domain_drop,
./net/sunrpc/svcauth.c: auth_domain_drop(&dom->h, &auth_domain_cache);
It will not always be this clear, but in our case the file to watch is obvious. Now we only need to compile svcauth.c with debugging, and let gdb loose on that. The easy way to do that is to rm svcauth.o and rerun 'make' like this:
$ EXTRA_CFLAGS=-g make ./net/sunrpc/svcauth.o
Now we are ready to run gdb:
$ gdb ./net/sunrpc/svcauth.o
(gdb) list *(auth_domain_drop+0x50)
0x1e0 is in auth_domain_drop (net/sunrpc/svcauth.c:147).
(...)
147 authtab[dom->flavour]->domain_release(dom);

And we have the offending line! Note how we asked a list of a relative address, auth_domain_drop+0x50, which is the top of our calltrace. I used the technique described above to report this exact oops and was rewarded with a patch within 6 hours, so it pays :-)

Important caveats

If you try to do this on a tree that is different from the one that is currently running, there is a great risk of producing bad results. If you feel that you really must do it, you need to make sure that you have:
When reporting line number derived this way, ALWAYS explain how you managed to get a line number. Kernel hackers will hate you and get really upset if you send them on a wild goose chase.

If things are wrong, it may look like this:
(gdb) list *(auth_domain_drop+0x50)
0x250 is in auth_domain_put (net/sunrpc/svcauth.c:160).
Which is decidedly bogus - suddenly our error has moved to another function, auth_domain_put(). This bad result was arrived at by recompiling the kernel with gcc 3.0.4 instead of gcc 3.2 that was used to compile the running kernel.

If you follow the method above on the source tree that was also used to build the kernel that produced the oops, you should be safe, but even then, check if you are recompiling using the same compiler that was used to build the kernel in the first place!

Furthermore, it IS theoretically possible for gcc to output different code when the -g flag is given, in which case too your results may be unreliable.

In that case, reload the module with debugging symbols, and see if the Oops address changes.