Showing posts with label debug. Show all posts
Showing posts with label debug. Show all posts

Sunday, September 24, 2017

What's the time?

Previously in this blog: ...two ways from here: either fix the floppy emulation, or make OFW for 40p with no floppy...

... or skip the call. You know, I have an armed debugger here and am not afraid to use it. So just turn the fatal call:


/usr/lib/methods/cfgfda_isa -2 -l fda0

into something harmless, like:

/bin/echo -2 -l fda0

by using

set *(int *) 0x200c11a8 = 0x2f62696e
set *(int *) 0x200c11ac = 0x2f656368
set *(int *) 0x200c11b0 = 0x6f000000

Well actually it probably should have been "/usr/bin/echo", there is no "/bin/echo" in the system. But obviously the attempt above was good enough for AIX, as it doesn't really need the floppy disk adapter (nor mouse & keyboard which I had to hack in a similar way at the second attempt). This brings AIX here:


Completed method for: fda0, Elapsed time = 0
Return code = 127
*** no stdout ****
***** stderr *****
sh: /usr/lib/methods/cfgfda_isa:  not found

Method error (/usr/lib/methods/cfgfda_isa -2 -l fda0 ):
        0514-068 Cause not known.
...
exec(/../usr/sbin/lqueryvg,-phdisk0,-L)
exec(/../usr/bin/grep,00000000000000000000000000000000)
exec(/usr/bin/dosread,-S,/preload,/preload)
exec(/usr/lpp/bosinst/datadaemon)
exec(/../usr/bin/sleep,1)

Where it hangs forever. And now the problem is sort of obvious. Yesterday I wrote that the boot log hadn't had shown any hint. But it did:


Time: 0 LEDS: 0x539
...
Time: 0 LEDS: 0x78a
...
Completed method for: bus0, Elapsed time = 0
...
Time: 0 LEDS: 0x539
...
Time: 0 LEDS: 0x868
...
Completed method for: scsi0, Elapsed time = 0
...

See? The clock is not ticking (it's probably caused by a QEMU bug, that "loadvm" command sometimes doesn't restore one of the machine timers. And I used the command a lot during the yesterdays session).

So basically there are two scenarios:
 - the clock is ticking - in this case AIX doesn't start any methods after spawning the init process
 - the clock is stopped - in this case it starts the methods up to the point where the timeouts are important. Probably if the clock had worked properly the boot process wouldn't had stopped at the floppy detection method.

Which means that debug process is getting real complicated. Now I have to debug the kernel scheduler, which is tricky. And obviously is different from AIX 4.2 which doesn't hang at that point.

The KDB from 5.1 has some features to see the scheduled timers, but I'm not sure it can be used to debug the interrupt handling. At least Solaris kadb was not good for debugging the interrupts, as it made a lot of side effects, and mostly hanged the system right after setting the breakpoint.

So, the good news: the most of the QEMU's 40p model devices are working properly. The bad news: finding a black sheep in a dark room is pretty hard.

Saturday, September 23, 2017

Some experiments with AIX 5.1

Since I could not find the AIX 4.2 install for Motorola, I gave AIX 5.1 under qemu-system-ppc a shot. The feelings are mixed, on one hand I've got no reference machine to check things, on the other hand the KDB debugger in AIX 5.1 is much more powerful than in 4.2. The initialization process of 5.1 is close to 4.2,  so I can recognize some structures. Which is good: the version 4.2 is quite different from 4.1.4 which I tried first. So I was afraid they made an equal leap in 4.x->5.x transition. Well, partially they did. Although the function names are more or less the same, the debugger made a great leap forward.

This stack trace looked like a flashback.

[01DE1BCC]init_pcicfg+000000 (2FF3A910 [??])
[01DE1380]config_pal+000030 (??)
[01DE12F8]config_planar_pal+0001D8 (??, ??)
[004832AC]config_kmod+000184 (??, ??, ??)
[004836E4]sysconfig+000104 (??, ??, ??)
[00003A94].sys_call+000000 ()
[10002668]cfgpal_rspc+0003E8 ()
[100016C0]main+000110 (??, ??, ??)
[10000188]__start+000088 ()

Under 4.2 it was like this:

(gdb) bt
#0  0x018d2b7c in ?? () -- pci_rw
#1  0x00088114 in ?? ()
#2  0x00088114 in ?? ()
#3  0x018d1db0 in ?? () -- init_crashdump 0x018d1d70
#4  0x018cf410 in ?? () -- config_pal  0x018cf35c
#5  0x018cf30c in ?? () -- config_planar_pal 0x018cf100
#6  0x000f9b3c in ?? () -- config_kmod 0x000f9a5c, 2 params, size 0x118
#7  0x000f9eb8 in ?? ()
#8  0x000037a8 in ?? ()

See the formatting differences and gaps? That's because under 4.2 I had to make the trace manually. Did I mention that the 4.2 debugger is eighties style? So, now I'm really enjoying the luxury of having a modern tool.

Also there is a possibility to make the output verbose:

KDB(0)> mw enter_dbg
enter_dbg+000000:  00000000  = 42
n_core+000000:  00000032  = .
KDB(0)>

But then there are also some bad news. There are still bugs (or missing features) in qemu. Even worse, there is at least one Heisenbug. Some times it gets to the PCI initialization and sometimes not. And in the cases where it doesn't get to PCI init it's really unclear why: it just sits in the idle loop, interrupts are enabled, and it receives the interrupts from the timer. Just for some reason it thinks there is nothing to do. Debugging such cases is a real nightmare.
So, I thought maybe go as far it can in case where it does reach PCI init and see any clues in the log.
No obvious clues, but here goes a pretty long log:

Time: 0 LEDS: 0x539
Number of running methods: 0
 cfgmgr LED{539}
----------------
Attempting to configure device 'bus0'
 cfgmgr LED{78A}
Time: 0 LEDS: 0x78a
Invoking /usr/lib/methods/cfgbus_pci -1 -l bus0
exec(/bin/sh,-c,/usr/lib/methods/cfgbus_pci -1 -l bus0)
Number of running methods: 1
exec(/usr/lib/methods/cfgbus_pci,-1,-l,bus0)
Breakpoint
.bus_register+000000     mflr    r0                  <01DEADA0>
KDB(0)> g
exec(/bin/sh,-c,/usr/lib/methods/define_rspc -c bus -s pci -t isa -p bus0 -w 88 -L 04-A0 -d)
exec(/usr/lib/methods/define_rspc,-c,bus,-s,pci,-t,isa,-p,bus0,-w,88,-L,04-A0,-d)
exec(/bin/sh,-c,/usr/lib/methods/cfgbus_isa -1 -l bus1)
exec(/usr/lib/methods/cfgbus_isa,-1,-l,bus1)
Breakpoint
.bus_register+000000     mflr    r0                  <01DF8FF0>
KDB(0)> g
exec(/bin/sh,-c,/usr/lib/methods/define_rspc -d -c adapter -s isa_sio -t fda -p bus1 -w PNP0700ffffffff -L 01-B0)
exec(/usr/lib/methods/define_rspc,-d,-c,adapter,-s,isa_sio,-t,fda,-p,bus1,-w,PNP0700ffffffff,-L,01-B0)
exec(/bin/sh,-c,/usr/lib/methods/define_rspc -d -c adapter -s isa_sio -t isa_keyboard -p bus1 -w PNP0303ffffffff -L 01-D0)
exec(/usr/lib/methods/define_rspc,-d,-c,adapter,-s,isa_sio,-t,isa_keyboard,-p,bus1,-w,PNP0303ffffffff,-L,01-D0)
exec(/bin/sh,-c,/usr/lib/methods/define_rspc -d -c adapter -s isa_sio -t isa_mouse -p bus1 -w PNP0F03ffffffff -L 01-E0)
exec(/usr/lib/methods/define_rspc,-d,-c,adapter,-s,isa_sio,-t,isa_mouse,-p,bus1,-w,PNP0F03ffffffff,-L,01-E0)
exec(/bin/sh,-c,/usr/lib/methods/define_rspc -d -c adapter -s isa_sio -t s1a -p bus1 -w PNP05011 -L 01-F0)
exec(/usr/lib/methods/define_rspc,-d,-c,adapter,-s,isa_sio,-t,s1a,-p,bus1,-w,PNP05011,-L,01-F0)
exec(/bin/sh,-c,/usr/lib/methods/define_rspc -c adapter -s pci -t ncr810 -p bus0 -w 96 -L 04-B0 -d)
exec(/usr/lib/methods/define_rspc,-c,adapter,-s,pci,-t,ncr810,-p,bus0,-w,96,-L,04-B0,-d)
----------------
Completed method for: bus0, Elapsed time = 0
Return code = 0
***** stdout *****
:devices.isa_sio.IBM000E :devices.isa_sio.PNP0400 :devices.pci.22100020
fda0,sioka0,sioma0,sa0,scsi0

*** no stderr ****
----------------
Time: 0 LEDS: 0x539
Number of running methods: 0
 cfgmgr LED{539}
----------------
Attempting to configure device 'fda0'
Method: /usr/lib/methods/cfgfda_isa not in boot image, configure in phase 2
----------------
Attempting to configure device 'sioka0'
Method: /usr/lib/methods/cfgkm_isa not in boot image, configure in phase 2
----------------
Attempting to configure device 'scsi0'
 cfgmgr LED{868}
Time: 0 LEDS: 0x868
Invoking /usr/lib/methods/cfgncr_scsi -1 -l scsi0
exec(/bin/sh,-c,/usr/lib/methods/cfgncr_scsi -1 -l scsi0)
exec(/usr/lib/methods/cfgncr_scsi,-1,-l,scsi0)
exec(/bin/sh,-c,/etc/methods/define -c disk -s scsi -t osdisk -p scsi0 -w 0,0)
exec(/etc/methods/define,-c,disk,-s,scsi,-t,osdisk,-p,scsi0,-w,0,0)
exec(/bin/sh,-c,/etc/methods/define -c cdrom -s scsi -t oscd -p scsi0 -w 2,0)
exec(/etc/methods/define,-c,cdrom,-s,scsi,-t,oscd,-p,scsi0,-w,2,0)
Number of running methods: 1
----------------
Completed method for: scsi0, Elapsed time = 0
Return code = 0
***** stdout *****
hdisk0 cd0
*** no stderr ****
----------------
Time: 0 LEDS: 0x539
Number of running methods: 0
 cfgmgr LED{539}
----------------
Attempting to configure device 'hdisk0'
Method: /etc/methods/cfgscdisk not in boot image, configure in phase 2
----------------
Attempting to configure device 'cd0'
 cfgmgr LED{723}
Time: 0 LEDS: 0x723
Invoking /etc/methods/cfgsccd -1 -l cd0
exec(/bin/sh,-c,/etc/methods/cfgsccd -1 -l cd0)
exec(/etc/methods/cfgsccd,-1,-l,cd0)
Number of running methods: 1
----------------
Completed method for: cd0, Elapsed time = 0
Return code = 0
*** no stdout ****
*** no stderr ****
----------------
Time: 0 LEDS: 0x539
Number of running methods: 0
 cfgmgr LED{539}
----------------
Time: 0 LEDS: 0x538
Invoking top level program -- "/usr/lib/methods/deflvm"
 cfgmgr LED{538}
exec(/bin/sh,-c,/usr/lib/methods/deflvm )
 cfgmgr LED{539}
Time: 0 LEDS: 0x539
Return code = 127
*** no stdout ****
***** stderr *****
sh: /usr/lib/methods/deflvm:  not found

Method error (/usr/lib/methods/deflvm):
        0514-068 Cause not known.
sh: /usr/lib/methods/deflvm:  not found

----------------
Time: 0 LEDS: 0x538
Invoking top level program -- "/usr/lib/methods/fdarcfgrule"
 cfgmgr LED{538}
exec(/bin/sh,-c,/usr/lib/methods/fdarcfgrule )
 cfgmgr LED{539}
Time: 0 LEDS: 0x539
Return code = 127
*** no stdout ****
***** stderr *****
sh: /usr/lib/methods/fdarcfgrule:  not found

Method error (/usr/lib/methods/fdarcfgrule):
        0514-068 Cause not known.
sh: /usr/lib/methods/fdarcfgrule:  not found

----------------
Time: 0 LEDS: 0x538
Invoking top level program -- "/usr/lib/methods/defssar"
 cfgmgr LED{538}
exec(/bin/sh,-c,/usr/lib/methods/defssar )
 cfgmgr LED{539}
Time: 0 LEDS: 0x539
Return code = 127
*** no stdout ****
***** stderr *****
sh: /usr/lib/methods/defssar:  not found

Method error (/usr/lib/methods/defssar):
        0514-068 Cause not known.
sh: /usr/lib/methods/defssar:  not found

 cfgmgr LED{FFF}
Configuration time: 0 seconds
+ 1> /etc/filesystems
+ /usr/lib/methods/showled 0x517
exec(/usr/lib/methods/showled,0x517)
 showled LED{517}
+ bootinfo -b
exec(/usr/sbin/bootinfo,-b)
exec(/usr/lib/boot/bin/bootinfo_rspc,-b)
+ mount -v cdrfs -o ro /dev/cd0 /SPOT
exec(/usr/sbin/mount,-v,cdrfs,-o,ro,/dev/cd0,/SPOT)
exec(/usr/bin/sh,-c,/usr/sbin/wlmcntrl -u -d "" > /dev/null 2>&1)
+ [ 0 -ne 0 ]
+ /usr/lib/methods/showled 0x512
exec(/usr/lib/methods/showled,0x512)
 showled LED{512}
+ /SPOT/usr/bin/rm -r /etc/init /usr/bin /usr/lib/boot /usr/lib/drivers/ataide /usr/lib/drivers/ataidepin /usr/lib/drivers/cfs.ext /usr/lib/drivers/idecdrom /usr/lib/drivers/idecdrompin /usr/lib/drivers/isa /usr/lib/drivers/pci /usr/lib/drivers/planar_pal_rspc /usr/lib/drivers/scdisk /usr/lib/drivers/scdiskpin /usr/lib/methods/cfgataide /usr/lib/methods/cfgbus_isa /usr/lib/methods/cfgbus_pci /usr/lib/methods/cfgidecdrom /usr/lib/methods/cfgncr_scsi /usr/lib/methods/cfgsccd /usr/lib/methods/cfgsys_rspc /usr/lib/methods/chggen /usr/lib/methods/chggen_rspc /usr/lib/methods/define /usr/lib/methods/define_rspc /usr/lib/methods/defsys /usr/lib/methods/showled /usr/lib/methods/ucfgdevice /usr/sbin
exec(/SPOT/usr/bin/rm,-r,/etc/init,/usr/bin,/usr/lib/boot,/usr/lib/drivers/ataide,/usr/lib/drivers/ataidepin,/usr/lib/drivers/cfs.ext,/usr/lib/drivers/idecdrom,/usr/lib/drivers/idecdrompin,/usr/lib/drivers/isa,/usr/lib/drivers/pci,/usr/lib/drivers/planar_pal_rspc,/usr/lib/drivers/scdisk,/usr/lib/drivers/scdiskpin,/usr/lib/methods/cfgataide,/usr/lib/methods/cfgbus_isa,/usr/lib/methods/cfgbus_pci,/usr/lib/methods/cfgidecdrom,/usr/lib/methods/cfgncr_scsi,/usr/lib/methods/cfgsccd,/usr/lib/methods/cfgsys_rspc,/usr/lib/methods/chggen,/usr/lib/methods/chggen_rspc,/usr/lib/methods/define,/usr/lib/methods/define_rspc,/usr/lib/methods/defsys,/usr/lib/methods/showled,/usr/lib/methods/ucfgdevice,/usr/sbin)
...
Attempting to configure device 'fda0'
 cfgmgr LED{828}
Time: 0 LEDS: 0x828
Invoking /usr/lib/methods/cfgfda_isa -2 -l fda0
exec(/bin/sh,-c,/usr/lib/methods/cfgfda_isa -2 -l fda0)
Number of running methods: 1
exec(/usr/lib/methods/cfgfda_isa,-2,-l,fda0)

Now it hangs on the floppy disk adapter init. Looks like there is no timeout. Strange.
There are two ways from here: either fix the floppy emulation, or make OFW for 40p with no floppy...

Sunday, July 23, 2017

Wiretapping AIX

Identified a couple of kernel and shared library functions, so I'm not poking in the dark anymore:

First of all I found execv. It gives a lot of insights about the AIX boot process. The process is quite different from Linux or Solaris boot. Kernel is small, and actually is already loaded, even under QEMU. The most other operating systems would write a greeting once a kernel is loaded. AIX does it all silently. On IBM machines there is a LED panel showing one byte of a status. On the Motorola there are just two LEDs which can light green or yellow, which altogether gives just 9 combinations. Not very informative. But even if I had one byte,  it still would not help. I look for error messages like "missing property", "unknown PCI chip", "missing residual data", etc.

The initialization of the PCI bus happens long after  the kernel spawns the /etc/init process.

Breakpoint 20, 0x0008cd38 in ?? ()
(gdb) x/s $r3
0x20051d08:     "/etc/methods/defsys"
(gdb) c
Continuing.
Breakpoint 20, 0x0008cd38 in ?? ()
(gdb) x/s $r3
0x2ff22090:     "/bin/sh"
(gdb) c
Continuing.
Breakpoint 20, 0x0008cd38 in ?? ()
(gdb) x/s $r3
0x20051d28:     "/usr/lib/methods/cfgsys_MOT3F00"       <= here is where it can't find the PCI bus

Then I found the printf and sprintf functions. Although AIX doesn't write anything on the screen, it still collects the boot log messages, so wiretapping  printf and fprintf helps to see them.

The house is still dark but now I have a search light. So whatever bugs are there, beware, you are going to be seen soon!

Saturday, July 22, 2017

Debugging AIX 4.2 boot

I wonder if it is possible to make the AIX 4.2 boot more verbose.
The various sources say that it should be done via
 
mw enter_dbg

under KDB. The AIX version I have doesn't have it. In fact it even doesn't have an option to disassemble a piece of code. Just the hardcore hex-dump, pretty much like it was in eighties.

That feeling when you started with a retro-computing and ended up with a steam punk computing.

ok  boot /scsi/disk@6 -s trap
Trap instruction interrupt.
> mw enter_dbg
032-001  You entered a command «mw» that is not valid.
> help
alter   … (a)lter — alter memory
back    … (b)ack — decrement the IAR
ditto   … «» — blank repeats the last command
break   … (br)eak — set a breakpoint
breaks  … (breaks) — list currently set breakpoints
buckets … (bu)ckets — display kmembucket structures
clear   … (c)lear — clear breakpoint(s)
display … (d)isplay — display a specified amount of memory
dmodsw  … (dm)odsw — display Streams dmodsw table
drivers … (dr)ivers — display device driver (devsw) table
find    … (f)ind — find a string in memory
float   … (fl)oat — display floating point registers
fmodsw  … (fm)odsw — display Streams fmodsw table
fs      … fs — display file system data structures
go      … (g)o — start executing the program
help    … (h)elp — display the list of valid commands
loop    … (l)oop — execute until control returns to this point
map     … (m)ap — display the system loadlist
mblk    … (mb)lk — display mblk/kmemstat structures
next    … (n)ext — increment the IAR
origin  … (o)rigin — set the origin
proc    … (p)roc — process table display
quit    … (q)uit — end the debugger session
queue   … (que)ue — display Streams queues
reset   … (r)eset — release a user defined variable
restore … (re)store — restore or do not restore the screen
screen  … (s)creen — display a screen containing registers and memory
set     … (se)t — define an/or set a variable
sregs   … (sr)egs — display segment registers
st      … (st) — store a full word into memory
stack   … (sta)ck — formatted stack trace
stc     … (stc) — store one byte into memory
step    … (ste)p — perform an instruction single-step
sth     … (sth) — store a half word into memory
stream  … (str)eam — display Stream head structures
swap    … (sw)ap — switch from the current display/keyboard to RS-232 port
thread  … (th)read — thread table display
trace   … (tr)ace — print traceback buffer
trb     … (trb) — display formatted timer request block info
tty     … (tt)y — Display tty struct
user    … (u)ser — formatted user area
uthread … (ut)hread — formatted uthread area
vars    … (v)ars — display a listing of the user_defined variables
vmm     … vmm — display virtual memory data structures
xlate   … (x)late — display the real address of a memory location
>

Saturday, August 6, 2016

Solaris 10 and year 2038 problem

Now I got a moment of a spare time to write why the Solaris 10 boot was failing under the new sun4v (sparc64) emulation target for QEMU.

It turned out that the now solved SMF issues I mentioned before were caused by a single character typo.

Stepping through the SQLite code I’ve noticed that there are two schemes: one persistent, which to my surprise has been opened with no problems, and a temporary one which failed because it could not create a file under /etc/svc/volatile which resides in RAM.

Why? Because of a very funny reason. The old Solaris versions used to check whether Real Time Clock (sometimes they call it “rtc”, sometimes they call it tod) returned a sane value and ignored it if it's not.

Solaris 10 issues a warning, but goes on and uses the given time. Then init system call creating file on a UFS considers time after 0x7fffffff invalid, which sends SMF into busy error loop.

The fatal typo was writing “qemu_clock_get_ns” instead of “qemu_clock_get_ms”, so I hit the error which the rest of the mankind using Solaris 10 for OpenSPARC T1 will hit 22 years later.

So let’s wait and see how many people will find my blog entries about SMF in February 2038.


Sunday, February 28, 2016

What do SQL and SPARCv9 assembly language have in common?

Well, here we go: I’m debugging SQL execution switching between the kmdb kernel debugger and gdb.

Breakpoint 70, 0x000000000003e528 in sqliteInitOne ()
0x000000000003ec9c in sqlite_exec ()
(gdb) x $i1
0xadea8:        "SELECT type, name, rootpage, sql, 0 FROM \"main\".sqlite_master"

SMF uses sqlite, so the boot process involves some SQLs.
Who would think that 20 years ago?

But it’s fun indeed. Booting Solaris/sparc under sun4v not just involves plain repetition of the old exercises, but requires some totally new ones as well.

Saturday, February 27, 2016

Dial 1-555-MY-SMF

The boot process of the Solaris 2.5 – Solaris 9 is quite robust. If init for some reason fails, there is always a chance to add “-b” boot option and try to debug it manually.

I think the old generation of the Sun engineers implemented it just to make debugging on the real world hardware easier. I really appreciated this option 6 years ago as I was making Solaris/sparc under qemu possible.

Nowadays at the early stages they probably do the most of debugging in simulators.

This would explain why boot process debugging became much harder after introducing SMF in Solaris 10.

Particularly I’m hitting the following crash, happening multiple times pro second in an endless loop:

cpu0: UltraSPARC-T1 (cpuid 0 clock 5 MHz)
iscsi0 at root
iscsi0 is /iscsi

INIT: Executing svc.startd

svc.configd: smf(5) database integrity check of:

    /etc/svc/repository.db

  failed. The database might be damaged or a media error might have
  prevented it from being verified.  Additional information useful to
  your service provider is in:

    /etc/svc/volatile/db_errors

  The system will not be able to boot until you have restored a working
  database.  svc.startd(1M) will provide a sulogin(1M) prompt for recovery
  purposes.  The command:

    /lib/svc/bin/restore_repository

  can be run to restore a backup version of your repository.  See
  http://sun.com/msg/SMF-8000-MY for more information.

Requesting System Maintenance Mode
(See /lib/svc/share/README for more information.)
svc.configd exited with status 102 (database initialization failure)



On the other hand, now I can use the source of OpenSolaris and step through it in gdb. Different epoch different debug methods.

Saturday, February 20, 2016

Bad, bad cafe! (0xbaddcafe)

Debugging Solaris 10 boot I saw something interesting in an exception trace:

143368: Unaligned Memory Access (v=0034)
pc: 00000000f02421f8  npc: 00000000f02421fc
%g0-3: 0000000000000000 0000000000000001 0000000000000000 00000000edd00620
%g4-7: baddcafebaddcafe 0000000000002e7f 0000000000000000 00000000f0243de8 
%o0-3: 00000000018d46e0 0000000000000001 00000000ede8e7e1 0000000001213010

And indeed, this is not a random pattern. It's a helping hand from the great, wise Solaris engineers who cared to help the ancestors in finding problems with hardware and kernel modules:

opensolaris/usr/src/uts/common/sys/kmem_impl.h:
#define  KMEM_UNINITIALIZED_PATTERN      0xbaddcafebaddcafeULL

Looking at the OpenSolaris sources and Solaris documentation, there are more such helping patterns:

Uninitialized Data: 0xbaddcafe
Redzone: 0xfeedface
Freed Buffer Checking: 0xdeadbeef

They are described in the "Detecting Memory Corruption" chapter of Solaris Modular Debugger Guide, but did actually appear long before mdb.

Saturday, February 6, 2016

Yo dawg, I heard you like debugging



Here is the story: my sun4v can boot OBP, but booting Solaris 10 hangs with no error messages. Ok, being there, done that. Let’s start the Solaris kernel with a debugger. I really liked kadb for debugging early boot stuff, but the Solaris 10 image supplied with the OpenSPARC project has only its successor - kmdb.  Well, kmdb is indeed more advanced, but it’s also quite bigger than its predecessor.  Which might be (or might be not) the reason for it failing to boot:

Sun Fire T2000, No Keyboard
Copyright 2005 Sun Microsystems, Inc.  All rights reserved.
OpenBoot 4.20.0, 256 MB memory available, Serial #1122867.
[mo23723 obp4.20.0 #0]
Ethernet address 0:80:3:de:ad:3, Host ID: 80112233.
ok boot -kdv
Boot device: /virtual-devices/disk@0  File and args: -kdv
Loading ufs-file-system package 1.4 04 Aug 1995 13:02:54.
FCode UFS Reader 1.12 00/07/17 15:48:16.
Loading: /platform/SUNW,Sun-Fire-T2000/ufsboot
Loading: /platform/sun4v/ufsboot
The boot filesystem is logging.
The ufs log is empty and will not be used.
Size: 0x76e40+0x1c872+0x3123a Bytes
module /platform/sun4v/kernel/sparcv9/unix: text at [0x1000000, 0x1076e3f] data at 0x1800000
module misc/sparcv9/krtld: text at [0x1076e40, 0x108f737] data at 0x184dab0
module /platform/sun4v/kernel/sparcv9/genunix: text at [0x108f738, 0x11dd437] data at 0x18531c0
module /platform/sun4v/kernel/misc/sparcv9/platmod: text at [0x11dd438, 0x11dd43f] data at 0x18a4be0
module /platform/sun4v/kernel/cpu/sparcv9/SUNW,UltraSPARC-T1: text at [0x11dd440, 0x11e06ff] data at 0x18a5300
Loading kmdb...
module /platform/sun4v/kernel/misc/sparcv9/kmdbmod: text at [0x11e0700, 0x124b2bf] data at 0x18b4da0
module /kernel/misc/sparcv9/ctf: text at [0x124b2c0, 0x1252d97] data at 0x18d6ed0
module /kernel/misc/sparcv9/zmod: text at [0x1252d98, 0x1257a67] data at 0x18d7af8
failed to decompress CTF data for unix: File data structure corruption detected
failed to decompress CTF data for genunix: String name offset is corrupt
failed to decompress CTF data for ctf: File data structure corruption detected
failed to decompress CTF data for zmod: File data structure corruption detected


What is the solution? Connect another debugger (gdb) to QEMU and debug the Solaris debugger (kmdb). Sounds reasonable, right?  In the next step I found a place where memory is already corrupted. This has been easy: as you see, the Solaris engineers put some sanity checks in the CTF code. Well done, Sun guys!
Finding the place where it gets corrupted is a bit harder: gdb has no watch-points on the physical memory, supporting only virtual memory watch-points. The solution is indeed starting the QEMU process itself in a debugger. At this point it gets slightly insane:

I put a debugger (kmdb) in a debugger (gdb x86-64) and connected it to a debugger (gdb sparc-v9) so I can debug while I’m debugging a debugger.

Sunday, November 22, 2009

Hidden OBP feature found

debugging the initial Power-On-Self-Test of OBP 2.29 I found a secret level a cool undocumented feature, PromDiag. Whenever I turn it on, instead of getting a usual OBP "OK" prompt I get:

PromDiag
NOK>

I wonder what is "NOK"? Does it mean "Not OK"? Anyway, I played with it a little. It runed out that it can launch single POST tests, and there are some more features, which have to be discovered yet. All in all it accepts just a few symbols: numbers, dot, comma, c, h, l, q, r, s: