Saturday, September 2, 2017

More fun with AIX cfgncr_scsi

<in the previous part>... doesn't find SCSI disks. Here it is tricky, it may be a problem with the interrupt routing, or DMA or SCSI host emulation...

... or a bug in the AIX driver itself.

As AIX 4.2 tries to perform scsi inquiry, that's what happens in the QEMU log:

(qemu) lsi_scsi: Write reg ??? ac = e4
lsi_scsi: Write reg ??? ad = 38
lsi_scsi: Write reg ??? ae = c4
lsi_scsi: Write reg ??? af = 81

The register at 0xac-0xaf is DSA Relative Selector (DRS). Is known to qemu, but seems to be not used in any operations.

The newer LSI53c1010-66 manual says:

"This register supplies AD[63:32] during Table Indirect
Fetches and Load/Store Data Structure Address (DSA)
relative operations"

So, maybe just add the support of this register to QEMU and allow the 64 bit DMA transfers, right?

Wrong. The write to this register is the last write and it doesn't start any SCSI command. Let's look where it happens:

   0x018ff854:  stw     r8,-4(r7)
   0x018ff858:  li      r4,44         ; 0x2c
   0x018ff85c:  b       0x18fe348     ; p8xx_write_reg <= write happens here

The register r4 is 0x2c, but the procedure writes to 0xac. Weird.

Let's look at the other registers:

0x018fe368 in ?? ()
(gdb) info registers r3 r5 r4
r3             0x18f5000        26169344
r5             0x31000080       822083712
r4             0x2c     44

What's that 80 at the end of r5? 0x80 + 0x2c is 0xac. Coincidence? Don't think so.

So, what happens here is the driver tries to write 0x2c, but the bus is shifted, so it hits 0xac. After some chasing I found where this shift is coming from:

   0x018fdb9c:  bl      0x1909938
   0x018fdba0:  lwz     r2,20(r1)
   0x018fdba4:  cmpwi   cr1,r3,19
   0x018fdba8:  beq     cr1,0x18fdbc8
   0x018fdbac:  li      r8,1
   0x018fdbb0:  stb     r8,256(r28)
   0x018fdbb4:  li      r3,0
   0x018fdbb8:  bl      0x1909938
   0x018fdbbc:  lwz     r2,20(r1)
   0x018fdbc0:  lwz     r8,160(r28)
   0x018fdbc4:  b       0x18fdbd0
   0x018fdbc8:  stb     r29,256(r28)
   0x018fdbcc:  lwz     r8,160(r28)
   0x018fdbd0:  lis     r11,4096
   0x018fdbd4:  addic   r10,r8,128    ; this is the 0x80 I'm looking for
   0x018fdbd8:  li      r8,-1
   0x018fdbdc:  rlwinm  r31,r26,1,15,30
   0x018fdbe0:  addic   r23,r28,10580
   0x018fdbe4:  stw     r10,252(r28)
   0x018fdbe8:  li      r25,1
   0x018fdbec:  addi    r30,r28,0
   0x018fdbf0:  stwu    r25,10512(r30)
   0x018fdbf4:  stw     r8,10588(r28)
   0x018fdbf8:  lwz     r8,10500(r28)
   0x018fdbfc:  stw     r11,10520(r28)
   0x018fdc00:  stw     r10,10532(r28) ; and here it is stored

It is added and stored unconditionally. If I drop this addic, something different happens:

(qemu) lsi_scsi: Write reg DSP0 2c = e4
lsi_scsi: Write reg DSP1 2d = 58
lsi_scsi: Write reg DSP2 2e = c4
lsi_scsi: Write reg DSP3 2f = 81
lsi_scsi: SCRIPTS dsp=81c458e4 opcode 41000000 arg 81c45a44
lsi_scsi: Selected target 0
lsi_scsi: SCRIPTS dsp=81c458ec opcode 78370000 arg 00000000
lsi_scsi: Read-Modify-Write reg 0x37 MOV data8=0x00 sfbr=0x00

Why would it work on the physical hardware? I guess because the addresses are aliased. Pretty similar to the le bug in Solaris.

So, it's not that QEMU has some unimplemented registers. In this case it has too many implemented ones.

On the other hand, it still doesn't detect the scsi disk, so maybe it has not just too much features, but too few as well...

/Stay tuned


Jason Stevens said...

whoa, totally awesome!

IBM always seems to do stuff in a weird way with the hardware.. Back in the day they gave me a version of AIX with my equipment (I was at a bank) and naturally it didn't work, I had to sit on the phone with them for hours debugging their install, and having some crazy fix in maintenance mode to tweak their registry thing to get it to work. I found that systems booted far more reliably in maintenance mode too.. Although I doubt that helps you in the slightest.

Still, awesome progress!

atar said...

Yeah, I already realize that it's really really hard to debug what's going on in AIX without phoning IBM.

In my case it's even more fun: I use the Motorola version of AIX. I guess the Motorola engineers had other ideas about the debug process than IBM. There are some additional boot options for instance, the "boot -s prompt" option, which I've shown in the previous post. I found them using 'strings disk-dump.img'. The prompt does look interesting, for instance it has a 'service mode' switch. Unfortunately I see no difference if I turn it on. So, at some point I'm going to need the AIX 4.2 install media for the Motorola machines.

Also Motorola guys left more debug info in their code. "boot -s verbose" is really verbose up to the point where the kernel starts. And after that its a pure IBM silence. The NCR driver is built with no debug information, except the function names. Some kernel functions are built even without the function names. I suspect the two of them are mutex_lock and mutex_unlock, but it's really a shot in a dark.

So I've learned the PPC assembly language to get up to this point, but to solve the current problem I have to learn the NCR/LSI script language.

Somehow it's ironical: ~8 years ago I started my journey into the qemu world by debugging the SCSI inquiry command of a SPARC NCR HBA. And here am I again. :-)

Jason Stevens said...

SCSI really is too complicated!!! Sometimes I wonder if mainframe channel attached storage is simpler.