Showing posts with label bug. Show all posts
Showing posts with label bug. Show all posts

Saturday, September 2, 2017

More fun with AIX cfgncr_scsi

<in the previous part>... doesn't find SCSI disks. Here it is tricky, it may be a problem with the interrupt routing, or DMA or SCSI host emulation...

... or a bug in the AIX driver itself.

As AIX 4.2 tries to perform scsi inquiry, that's what happens in the QEMU log:

(qemu) lsi_scsi: Write reg ??? ac = e4
lsi_scsi: Write reg ??? ad = 38
lsi_scsi: Write reg ??? ae = c4
lsi_scsi: Write reg ??? af = 81

The register at 0xac-0xaf is DSA Relative Selector (DRS). Is known to qemu, but seems to be not used in any operations.

The newer LSI53c1010-66 manual says:

"This register supplies AD[63:32] during Table Indirect
Fetches and Load/Store Data Structure Address (DSA)
relative operations"

So, maybe just add the support of this register to QEMU and allow the 64 bit DMA transfers, right?

Wrong. The write to this register is the last write and it doesn't start any SCSI command. Let's look where it happens:

p8xx_start_chip:
...
   0x018ff854:  stw     r8,-4(r7)
   0x018ff858:  li      r4,44         ; 0x2c
   0x018ff85c:  b       0x18fe348     ; p8xx_write_reg <= write happens here

The register r4 is 0x2c, but the procedure writes to 0xac. Weird.

Let's look at the other registers:

0x018fe368 in ?? ()
(gdb) info registers r3 r5 r4
r3             0x18f5000        26169344
r5             0x31000080       822083712
r4             0x2c     44

What's that 80 at the end of r5? 0x80 + 0x2c is 0xac. Coincidence? Don't think so.

So, what happens here is the driver tries to write 0x2c, but the bus is shifted, so it hits 0xac. After some chasing I found where this shift is coming from:

p8xx_config:
...
   0x018fdb9c:  bl      0x1909938
   0x018fdba0:  lwz     r2,20(r1)
   0x018fdba4:  cmpwi   cr1,r3,19
   0x018fdba8:  beq     cr1,0x18fdbc8
   0x018fdbac:  li      r8,1
   0x018fdbb0:  stb     r8,256(r28)
   0x018fdbb4:  li      r3,0
   0x018fdbb8:  bl      0x1909938
   0x018fdbbc:  lwz     r2,20(r1)
   0x018fdbc0:  lwz     r8,160(r28)
   0x018fdbc4:  b       0x18fdbd0
   0x018fdbc8:  stb     r29,256(r28)
   0x018fdbcc:  lwz     r8,160(r28)
   0x018fdbd0:  lis     r11,4096
   0x018fdbd4:  addic   r10,r8,128    ; this is the 0x80 I'm looking for
   0x018fdbd8:  li      r8,-1
   0x018fdbdc:  rlwinm  r31,r26,1,15,30
   0x018fdbe0:  addic   r23,r28,10580
   0x018fdbe4:  stw     r10,252(r28)
   0x018fdbe8:  li      r25,1
   0x018fdbec:  addi    r30,r28,0
   0x018fdbf0:  stwu    r25,10512(r30)
   0x018fdbf4:  stw     r8,10588(r28)
   0x018fdbf8:  lwz     r8,10500(r28)
   0x018fdbfc:  stw     r11,10520(r28)
   0x018fdc00:  stw     r10,10532(r28) ; and here it is stored
...

It is added and stored unconditionally. If I drop this addic, something different happens:

(qemu) lsi_scsi: Write reg DSP0 2c = e4
lsi_scsi: Write reg DSP1 2d = 58
lsi_scsi: Write reg DSP2 2e = c4
lsi_scsi: Write reg DSP3 2f = 81
lsi_scsi: SCRIPTS dsp=81c458e4 opcode 41000000 arg 81c45a44
lsi_scsi: Selected target 0
lsi_scsi: SCRIPTS dsp=81c458ec opcode 78370000 arg 00000000
lsi_scsi: Read-Modify-Write reg 0x37 MOV data8=0x00 sfbr=0x00
...

Why would it work on the physical hardware? I guess because the addresses are aliased. Pretty similar to the le bug in Solaris.

So, it's not that QEMU has some unimplemented registers. In this case it has too many implemented ones.

On the other hand, it still doesn't detect the scsi disk, so maybe it has not just too much features, but too few as well...

/Stay tuned

Sunday, May 6, 2012

Qemu is going to boot Linux/sparc64

After considering it a bit, I thought, it's a good marketing strategy: the free QEMU version shall run the free OS - Linux. And if anyone is interested in running something else, feel free to ask me for a [paid] support. :-)

So, Linux is going to be the second OS which vanilla qemu-system-sparc64 would boot - HelenOS was the first one. But, unlike HelenOS, Linux will be fully functional, having not just a stdout, but a stdin as well. And a serial port support!

At the moment OpenBIOS has some missing features - it doesn't describe the interrupt mappings - and a regression - currently it can't even boot HelenOS from a command line. But both are not show-stoppers.

Once my patches are accepted I'll publish the OpenBIOS command to boot Linux (not because it's top secret, but due to the dependency to a certain version).

Sunday, April 22, 2012

Sent y2k10 fix upstream

It occurred to me that I've described the fix for y2k10 bug in qemu-system-sparc which makes Solaris 2.5.1 (and prior) hang, but never managed to submit it upstream.

So, let's fix it. For those one and half users who care ;-) .

Sunday, January 22, 2012

Does anybody use recent qemu versions?

After a long break I tried a current qemu git/master today. What can I say? The SPARC-related stuff is broken in a few places since a few months. SCSI signals an error for the "Inquiry" command, Leon3 test hangs, sparc64 is broken too. Bisecting is tricky though, because at least sparc64 seems to broken in different ways in subsequent commits.

Don't have time to fix it myself, but will send a couple of bug reports to the mailing list. Meanwhile, what is your favorite qemu version?

Update: meanwhile (7/Feb/2012) SCSI and Leon3 issues are fixed, thanks to Thomas Higdon and Fabien Chouteau.

Saturday, July 2, 2011

More math bugs

 Why would one file system work, and another one - not?

Because the fs driver is buggy? Of course not! The ones who read this blog on a regular basis, know that inability to read the file has to do with a bug in a math emulation.
Thanks to Jakub's test case, I fixed two bugs with carry flag handling (yes, again). Now HelenOS/sparc64 boot looks like this:


My impressions of HelenOS - it's neat, the sources are good documented and easy readable. Also I needed just a few minutes to set up cross compiling under Linux/x86_64. So, if you need a small, micro-kernel(!) and multi-arch (amd64, arm32, ia32, ia64, mips32, ppc32, sparc64) OS to play with, HelenOS is definitely worth looking at.

Saturday, May 21, 2011

Seen a really broken pipe

Have you seen a broken pipe? Sounds like a stupid question for everyone who is working with *NIX. Everyone who has English as a mother tongue, or is old enough to use a non-localized OS has seen a "broken pipe" message. A younger generation might have seen the message in their native language.

Well, that's not the sort of brokenness  I'm talking about.  I mean it more literary:

# echo "This pipe works fine" |cat
This pipe works fine
# echo "And this pipe is veeeeeeeeeeeeeeeeeeeery broken. This pipe is really very broken. Broken." | cat
Ths pip eisveTkebn

Ops. Had to spend a lot of of time in Solaris ascending from SCSI driver (at first I didn't realize that the bug appears in pipes, it looked like a DMA bug) to streams, pipe and so on and then descending to the memcpy which was the source of trouble. It turned out, memcpy uses different routines for small and large data chunks. The small chunks are copied in a loop word-wise and the large ones are copied using VIS instructions (that's SPARC equivalent for Intel's MMX). The emulation of VIS instructions in qemu is buggy, so the memory gets corrupted when these instructions are used.

Once again I'm very impressed that Solaris 2.5.1 - Solaris 7 boots without problems on such a broken hardware. Let's see if  the newer Solaris versions would work with this emulation bug fixed.

Saturday, October 9, 2010

Bug in all Solaris versions after 5.7?!?

Tried to clean up some patches for submitting upstream, and it turned out that one of the hacks is needed because of a Solaris bug! I'm really astonished.
The init routine of the network card driver in Solaris 2.6 has this piece of code:

call      ddi_get_parent
ld        [%l0 + 0xc], %o0
call      ddi_get_driver_private
nop
add       %o0, 0x4, %g2
st        %g2, [%l0 + 0x720]

That's how it looks in Solaris 9:
call      ddi_get_parent
ld        [%l0 + 0x10], %o0
call      ddi_get_driver_private
nop
add       %o0, 0x10, %g2
st        %g2, [%l0 + 0x728]

Adding 0x10 to the base of dma registers, makes a pointer to a nowhere.

Yes, qemu is not precise, and doesn't emulate memory aliasing (Blue Swirl had a patch for it), but hey, Solaris works on sun4m only due to a coincidence!

So, all the Solaris versions from 5.7 to 9 can be booted in qemu by hot patching in kadb (booting kadb is already described in the how-to).

That's how I patched Solaris 9 for booting under qemu:
kadb[0]: le#leinit:bset a deferred breakpoint
kadb[0]: :c   continue execution
...
kadb[0]: leinit+0x654?i check that we are at the correct place
               add       %o0, 0x10, %g2
kadb[0]: leinit+0x654/X
leinit+0x654:   84022010
kadb[0]: leinit+0x654/W 84022004
leinit+0x654:   0x84022010    =   0x84022004       patch
kadb[0]: leinit:ddelete the breakpoint
kadb[0]: :c continue

Once again I can only recommend  reading the PANIC! UNIX System Crash Dump Analysis Handbook to understand the basics before patching anything.

Sunday, August 15, 2010

Fixed the "Solaris Y2K10" bug

Got back to Solaris/qemu Y2K10 bug. The name turned out to be misleading because a) it's not a Solaris bug and b) it's not a Y2K10 bug.

The reason for the bug was someone mixing hexadecimal and decimal values. Gonna check if there are more such places and send the trivial fix later.

Saturday, July 10, 2010

Fixed the slavio timer bug properly

Fixed the timer bug found last week more correctly. Will submit the patch later.

Gosh, it is hot over here. Today is something like 38C (~100F) . The cpu in my head is refusing to send any patches and fix any further bugs till the temperature drops.

Saturday, July 3, 2010

Fixed another bug in the slavio timer emulation

Trying to get NeXTStep/sparc to boot without any success, I got back to the old bug which seemed to be related: some versions of OBP hang at boot waiting for the timer interrupts.

Somehow I got poisoned by the motto of the qemu developers: OBP doesn't work because the initial set-up is wrong. On the real hardware no one would expect BIOS to work if the machine doesn't pass the power-on self test (POST). But then I it came to me that exactly this motto prevented other people to get OBP working under qemu-system-sparc.

So, I went on and asked Mitch, if he thought whether his creation - OBP - was buggy and relied on the [probably missing] POST initialization. Mitch said that he's pretty much sure that the OBP would do the right thing in this case, so I took another look at the qemu timer code, and fixed the bug.

The bug turned out to be unrelated to the NeXTStep boot problem. On the other hand the fix provides the alternatives to SPARCStation-5 emulation. Now it's possible to get SPARCStation-10 firmware to work, which gives 512m to the guest.

Here come some boot logs with the OBP from SPARCStation-10 and LX.

Saturday, June 19, 2010

Yet another bug in IRQ emulation

Trying to find out why NetBSD versions 1.6-3.1 do not boot, I found a bug in IRQ processing. After fixing it these versions still don't boot, :) but fail more gracefully.
Looks like there is another couple of bugs to go...

Monday, May 31, 2010

Another week another SCSI bug

Fixed Solaris 2.6+ boot which I accidentally broke last week. It's not that my Solaris 2.3 dma/irq fix was wrong, but the fix unleashed a counterpart interrupts handling bug in esp controller.

Too bad that no one reported it earlier. I wouldn't have to hack till midnight now. ;-) And thanks to VooDoo_UzH_ for reporting it.

Saturday, May 22, 2010

1993 reached

The time machine is working! Well, I had to fix another bug to do it. This time in DMA again. The Solaris 2.3/sparc can be installed under qemu!

Submitted the patch upstream. After it gets accepted, it should be possible to use Solaris 2.5.1- instructions from the how-to to install Solaris 2.3. I wonder if the patch also improves the situation with NetBSD 5.x stability. Feel free to report.

Saturday, May 8, 2010

sent the SunOS 4.1.4 patch upstream

Sent the SunOS 4.1.4 patch upstream. If it gets accepted, it should be possible to get up to "The Wh" using Solaris 2.5.1- instructions from the how-to.

Found another bug in qemu which I can't fix. Hoped that others can do it, so posted it to the mailing list, but got no responses. Actually the bug may be related to the SunOS fix I just sent: maybe SunOS tries to access a non-connected address not because it's buggy, but because qemu translates the address wrong.

If no-one from the mailing list answers I'll have to dig it further.

Sunday, April 18, 2010

FPU bugs

Found more bugs. This time in FPU. One was very promptly fixed by Blue Swirl (the mysterious qemu-sparc maintainer). Another one is more tricky. qemu goes astray and doesn't stop at breakpoint, so it's gonna be hard to find out what exactly is going on.

Wednesday, April 7, 2010

a couple of bugs in OpenBIOS

Found two bugs in OpenBIOS. I think I fixed one of them, but it can be checked only when the other one gets fixed. The other one is pretty deep in the Forth engine, don't want to mess with it. Once it will be fixed I'll get back to OpenBIOS.

Sunday, February 21, 2010

another week, another qemu bug again

Fixed one more CPU bug. Really surprised that it didn't affect Linux and Solaris versions prior 2.6. I'd expect that lots of multi-threaded code relying on mutex should have been affected...

Sunday, February 14, 2010

yet another dma bug

Fixed another bug in qemu sparc32 dma. This one seems to be only relevant for Linux guests though.

Sunday, February 7, 2010

another week, another qemu bug

There are very few qemu/sparc modules out there which I haven't had to touch. Since I've started I founded/fixed bugs in: irq, esp, cpuesp, esp, cpu, scsi-disk, cpuscsi-disk, fdd, tcx, mmu, slavio, mmu. Today this list is extended with (sparc32_)dma.

Fixed a bug in dma which produced spurious interrupts and incomplete reads/writes. Will submit the patch later on this week.

Sunday, October 11, 2009

The second bug in the qemu sparc CPU emulation

Mitch Bradley found a bug in the Sparc CPU emulation. I gave him access to my qemu session and he stepped through the code. Is sort of shame, I haven't done it myself, as I thought about it 2 weeks ago.

This bug is actually much more heavy than the previous one. While the previous one affected only the hand crafted assembly code, this one should hit the compiled code as well: the handling of carry flag in subxcc instruction is wrong. And, yes, it's RISC architecture, so this instruction is also used for comparison...

I'm really astonished that Linux/sparc is working under qemu since years. Of course Linux may be just more robust, but it also may mean that gcc doesn't use some sparcv8 instructions, and is therefore inefficient.