September 11th, 2004

(no subject)

When Arjan enabled ub in Rawhide, it tripped a number of bug reports. I was actually surprised to learn how popular Rawhide is, considering that Fedora itself does not have all that long an update cycle. Quite a number of people were upset and bitched about the demise of Red Hat Linux in a box and short life cycle of Fedora releases. But getting back on track, two main problems of ub were an Oops when a mounted stick was unplugged and that hald went crazy. Actually, it didn't, but it filled /var/log/messages with block level errors. It seems that hald was opening all devices periodically, and since we now have udev, it was able to open ub even though nobody ran mknod to create device nodes. At this level of automation I'm going to get hit in the nads by an automatic toilet seat very, very soon.

The hald problem received more attention, but actually it was an easy one. I should have sent a fix couple of weeks ago, but I all this time I hoped to fix the Oops RSN and send both. The Oops was caused by an attempt to call put_disk from ub_disconnect(), which pulled the rug from under the upper layers. My first reaction was to move it to ub_cleanup, which is a refcounted tail destructor, similar to scsi_disk_put. However, immediately I started to doubt that it was sufficiently bulletproof. The ub_cleanup is called from the release method, right? So, the release is not done at the time yet, structures are still in place. How doing put_disk there is any better than doing it from a disconnect? The whole construct rides on an assumption that nobody tries to do anything to the device (like an open) between the return of the release method and the teardown of all structures in the block device layer. And I just did not trust that to work and be race free.

Instead of doing the obvious, I sat and scoped an implementation where disk was never released, ever. Just its geometry was changing when media was pulled or devices were disconnected. Rock solid. However, the result was messy. First, I had to create states for the whole device, with a lot of confusion over what states are needed and what to do with them. Second, initialization had to be refactored into sendmail-ish mess, because now sometimes I received a fresh ub_dev with no disk, and sometimes a previously owned one with disk preallocated. And third, I had to copy the code to force a partition rescan from dasd.c. When I did that I understood that I went too far, quickly moved put_disk to ub_cleanup and sent a patch to Greg.

The result was two weeks of angry Rawhide users. Sometimes I'm wondering what is wrong with me? Maybe I should start a second career in real estate.