Sunday, February 28, 2016

What do SQL and SPARCv9 assembly language have in common?

Well, here we go: I’m debugging SQL execution switching between the kmdb kernel debugger and gdb.

Breakpoint 70, 0x000000000003e528 in sqliteInitOne ()
0x000000000003ec9c in sqlite_exec ()
(gdb) x $i1
0xadea8:        "SELECT type, name, rootpage, sql, 0 FROM \"main\".sqlite_master"

SMF uses sqlite, so the boot process involves some SQLs.
Who would think that 20 years ago?

But it’s fun indeed. Booting Solaris/sparc under sun4v not just involves plain repetition of the old exercises, but requires some totally new ones as well.

Saturday, February 27, 2016

Dial 1-555-MY-SMF

The boot process of the Solaris 2.5 – Solaris 9 is quite robust. If init for some reason fails, there is always a chance to add “-b” boot option and try to debug it manually.

I think the old generation of the Sun engineers implemented it just to make debugging on the real world hardware easier. I really appreciated this option 6 years ago as I was making Solaris/sparc under qemu possible.

Nowadays at the early stages they probably do the most of debugging in simulators.

This would explain why boot process debugging became much harder after introducing SMF in Solaris 10.

Particularly I’m hitting the following crash, happening multiple times pro second in an endless loop:

cpu0: UltraSPARC-T1 (cpuid 0 clock 5 MHz)
iscsi0 at root
iscsi0 is /iscsi

INIT: Executing svc.startd

svc.configd: smf(5) database integrity check of:

    /etc/svc/repository.db

  failed. The database might be damaged or a media error might have
  prevented it from being verified.  Additional information useful to
  your service provider is in:

    /etc/svc/volatile/db_errors

  The system will not be able to boot until you have restored a working
  database.  svc.startd(1M) will provide a sulogin(1M) prompt for recovery
  purposes.  The command:

    /lib/svc/bin/restore_repository

  can be run to restore a backup version of your repository.  See
  http://sun.com/msg/SMF-8000-MY for more information.

Requesting System Maintenance Mode
(See /lib/svc/share/README for more information.)
svc.configd exited with status 102 (database initialization failure)



On the other hand, now I can use the source of OpenSolaris and step through it in gdb. Different epoch different debug methods.

Saturday, February 20, 2016

Bad, bad cafe! (0xbaddcafe)

Debugging Solaris 10 boot I saw something interesting in an exception trace:

143368: Unaligned Memory Access (v=0034)
pc: 00000000f02421f8  npc: 00000000f02421fc
%g0-3: 0000000000000000 0000000000000001 0000000000000000 00000000edd00620
%g4-7: baddcafebaddcafe 0000000000002e7f 0000000000000000 00000000f0243de8 
%o0-3: 00000000018d46e0 0000000000000001 00000000ede8e7e1 0000000001213010

And indeed, this is not a random pattern. It's a helping hand from the great, wise Solaris engineers who cared to help the ancestors in finding problems with hardware and kernel modules:

opensolaris/usr/src/uts/common/sys/kmem_impl.h:
#define  KMEM_UNINITIALIZED_PATTERN      0xbaddcafebaddcafeULL

Looking at the OpenSolaris sources and Solaris documentation, there are more such helping patterns:

Uninitialized Data: 0xbaddcafe
Redzone: 0xfeedface
Freed Buffer Checking: 0xdeadbeef

They are described in the "Detecting Memory Corruption" chapter of Solaris Modular Debugger Guide, but did actually appear long before mdb.

Saturday, February 6, 2016

Yo dawg, I heard you like debugging



Here is the story: my sun4v can boot OBP, but booting Solaris 10 hangs with no error messages. Ok, being there, done that. Let’s start the Solaris kernel with a debugger. I really liked kadb for debugging early boot stuff, but the Solaris 10 image supplied with the OpenSPARC project has only its successor - kmdb.  Well, kmdb is indeed more advanced, but it’s also quite bigger than its predecessor.  Which might be (or might be not) the reason for it failing to boot:

Sun Fire T2000, No Keyboard
Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.20.0, 256 MB memory available, Serial #1122867.
[mo23723 obp4.20.0 #0]
Ethernet address 0:80:3:de:ad:3, Host ID: 80112233.
ok boot -kdv
Boot device: /virtual-devices/disk@0  File and args: -kdv
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.12 00/07/17 15:48:16.
Loading: /platform/SUNW,Sun-Fire-T2000/ufsboot
Loading: /platform/sun4v/ufsboot
The boot filesystem is logging.
The ufs log is empty and will not be used.
Size: 0x76e40+0x1c872+0x3123a Bytes
module /platform/sun4v/kernel/sparcv9/unix: text at [0x1000000, 0x1076e3f] data at 0x1800000
module misc/sparcv9/krtld: text at [0x1076e40, 0x108f737] data at 0x184dab0
module /platform/sun4v/kernel/sparcv9/genunix: text at [0x108f738, 0x11dd437] data at 0x18531c0
module /platform/sun4v/kernel/misc/sparcv9/platmod: text at [0x11dd438, 0x11dd43f] data at 0x18a4be0
module /platform/sun4v/kernel/cpu/sparcv9/SUNW,UltraSPARC-T1: text at [0x11dd440, 0x11e06ff] data at 0x18a5300
Loading kmdb...
module /platform/sun4v/kernel/misc/sparcv9/kmdbmod: text at [0x11e0700, 0x124b2bf] data at 0x18b4da0
module /kernel/misc/sparcv9/ctf: text at [0x124b2c0, 0x1252d97] data at 0x18d6ed0
module /kernel/misc/sparcv9/zmod: text at [0x1252d98, 0x1257a67] data at 0x18d7af8
failed to decompress CTF data for unix: File data structure corruption detected
failed to decompress CTF data for genunix: String name offset is corrupt
failed to decompress CTF data for ctf: File data structure corruption detected
failed to decompress CTF data for zmod: File data structure corruption detected


What is the solution? Connect another debugger (gdb) to QEMU and debug the Solaris debugger (kmdb). Sounds reasonable, right?  In the next step I found a place where memory is already corrupted. This has been easy: as you see, the Solaris engineers put some sanity checks in the CTF code. Well done, Sun guys!
Finding the place where it gets corrupted is a bit harder: gdb has no watch-points on the physical memory, supporting only virtual memory watch-points. The solution is indeed starting the QEMU process itself in a debugger. At this point it gets slightly insane:

I put a debugger (kmdb) in a debugger (gdb x86-64) and connected it to a debugger (gdb sparc-v9) so I can debug while I’m debugging a debugger.