Pete Zaitcev's Journal [entries|friends|calendar]
Pete Zaitcev

[ userinfo | livejournal userinfo ]
[ calendar | livejournal calendar ]

BadName is essentially conquered [02 May 2008|03:01am]

The issue with random applications failing to start (Firefox, Nautilus) or blowing up (panel, gvim) with BadName took me about 3 months to find (the bug was filed at the end of January). I'm not sure if my fix is any good, need to poke Ajax about it.

So... Wasted a lot of time, learned several mildly interesting things about the code and people involved.

The sad part is how much it takes to start moving around any modern codebase, and that's with the same language and toolchain. I remember times when no part of the system was off-limits, but these days... not so much. If anything breaks in OpenOffice, I'm not even going to try fixing it.

[link] post comment

Ted Tytso on [Open]Solaris [23 Apr 2008|11:00am]

Ted suddenly decided to talk OpenSolaris. Pretty interesting... at least for me, since I spent 7 best years of my life in Sun's orbit.

In passing, aside from the bulk of the post, it seems to me that the final argument, about competitors selling Solaris support, does not hold water. This is exactly what Oracle attempted with their clone of CentOS and they weren't very successful, despite having a strong Linux team under Wim.

Other than that, he's probably right. But he's going to get responses. Whenever I mention Solaris (last time it was when I linked to Jeff Bonwick's blog), I get the most inane responses from Solaris fanboys. It looks like a very vocal community of users, if not contributors. Sounds like Apple almost.

This puts the damper on any dreams I may have about re-living the glory of my youth by getting back to hacking on that codebase.

UPDATE: Not sure why Levon decided to post his reply to his personal blog instead the one at Sun. Surely the other one is more relevant?

[link] post comment

Random dmesg errors [19 Apr 2008|12:18pm]

I always was against kernel spewing user-generated errors into dmesg, like this:

npviewer.bin[4393]: segfault at f6712030 ip 67e7a0 sp ff9c39ec error 4 in libpthread-2.8.so[677000+15000]

Not helpful, not interesting.

However, the other day my desktop keeled over in a strange way... The /var/log/messages contained this (followed by a stack trace):

Apr 13 18:19:14 niphredil kernel: Xorg: page allocation failure. order:3, mode:0x4020

It looks like a bug in SLUB (does not seem registering with anyone who has the power to track it down though). But my point is, without the printout I would need to find what was happening by other means, and that would probably take forever.

Hmm... My world is shaken.

P.S. kgdb was merged into 2.6.26. The sky is falling.

[link] 4 comments|post comment

Jon Corbet on Red Hat and Desktop [17 Apr 2008|10:11am]

Seen at LWN today (no permalink — what the heck?):

Red Hat's desktop team has posted an item saying that the company has no plans to offer a "traditional desktop product" anytime soon.

Say what? The referenced item says:

[W]e have no plans to create a traditional desktop product for the consumer market in the foreseeable future.

Umm... RHEL desktop is doing quite well, all we're saying we're not committed to selling it at Best Buy. Not sure how this debacle has happened. Jon was probably short on coffee.

[link] 15 comments|post comment

ipv6.google.com [16 Apr 2008|09:22am]

If client is logged in, Google bounces to the old site. In order to access over IPv6, you have to log out and use http://ipv6.google.com/webhp. Apparently, not quite there yet.

One funny thing, using Google while logged out is much faster. Apparently it takes time for them to act upon cookies my client sends. Remind me again how they goaded everyone into this "homepage" thing. Ah yes, Gmail.

[link] 1 comment|post comment

Fallback-induced thoughts [15 Apr 2008|02:16pm]

I saw two or three bug filings in last couple of months which deal with a USB device not working until ehci_hcd is unloaded. Thinking sensibly, it's rather normal, a poorly-made or poorly-cabled device may choose to report High (480) speed yet will be unable to communicate at that speed. And a couple of devices failing across half a million of users is rare. However, the thing is, such cases were extremely rare before, I don't even remember the last time this happened. So, I'm starting to worry that EHCI hardware or software may have a subtle bug somewhere (perhaps specific silicon percolated to the field).

If only there was a way to tap into Novell's bugzilla and watch their kernel bugs, to collate with ours. Ditto the Bligh's Bugme and Ubuntu's whatever (Launchpad?).

For readily identifiable bugs, we just report them to linux-usb or whatever and then patterns just come together, but the problem of fallback-wannabe devices is too flimsy and vague.

P.S. By "fallback" I mean the new code which switches a port over to a Full (12) speed if enumeration fails. It's a practical solution, but it seems like sweeping the problem under the carpet to me. Also, it won't work for anything that's plugged into a hub.

UPDATE: Amit from Ubuntu pointed to their bug 88746. V.interesting.

[link] 1 comment|post comment

Unpleasant mass updates [14 Apr 2008|03:25pm]

Mass updates to Bugzilla have a few unpleasant side effects:

  • Unless they're done by DKL with direct access to the database, they generate a lot of e-mail which buries actual updates.
  • They destroy the usability of queries for "bugs modified in the last 60/45/30 days". It's a useful trick I learned from Arjan. But now all kernel bugs are recently modified.

The idea, I guess, is that developer has to rescan relevant bugs and either work on them, push them into NEEDINFO, or close them. If hackers are dilligent about it, auto-closer is harmless [and also, unnecessary -- ed]. In reality though, it just does not work that way [and the very existence of auto-closer is the proof -- ed]. At certain point, I started making extra-Bugzilla lists of bugs which look realistic to work on (e.g. have an active submitter who cooperates, for one thing). The rest just rots. I don't even have cycles to push WONTFIX on them (or, actually, I have time to close, but I don't want to deal with the fallout, so I just pretend not to see them -- the task made easier by the mass-update and the resulting mail avalanche).

P.S. My list of bugs is, like, 10 to 50 times smaller than Chuck's and DaveJ's. I don't understand how they cope. It seems impossible to me, so there must be some trade secret good kernel monkeys know.

[link] 2 comments|post comment

The Belgian paper [09 Apr 2008|08:14am]

Completely useless. Gee, the thugs running worst shitholes of the world can forge documents signed by children and make all their Web access trackable and non-refutable. Dog bites man. We knew before this paper that they condition children to carry their own telescreens. The only thing I want to know about the BitFrost is how to defeat it, and the paper doesn't say. Useless.

[link] post comment

-e for Elimination [07 Apr 2008|12:39pm]

After noticing that the annoying and useless (for me) orange star has no setting "go away forever", I concluded it was time to use "rpm -e". However, we have another case of House That Jack Built: system-config-printer needs /usr/bin/system-install-packages (because, you know, it wants to pull printer drivers for you automagically). But the system-install-packages is a part of the gnome front-end, not PackageKit itself (why? a mystery), and that includes the orange star (it's called pk-update-icon). Godly.

At least it's not a throbbing red eye, and not restarted when killed. Also, Yet Another Sneaky Daemon They Sprung On My System While I Looked The Other Way (packagkitd) quetly disappears after a while, releasing my precious memory. I sense some good intent here, but it's not good enough.

P.S. I'm testing if "Check for Updates: Never" and then killing means "go away".

P.P.S. Nope, it still restarts on the next login, and checks for updates. What part of "never" is unclear here?

[link] post comment

Timeouts [06 Apr 2008|08:11pm]

I didn't try to burn a CD with ub in a while, because my new laptop comes with a built-in burner. After all the hustling with __blk_end_request, I thought the situation called for a test. This looked worrysome:

Track 01: Total bytes read/written: 548321280/548321280 (267735 sectors).
Errno: 5 (Input/output error), close track/session scsi sendcmd: cmd timeout after 5.000 (480) s
CDB:  5B 00 02 00 00 00 00 00 00 00
cmd finished after 5.000s timeout 480s
cmd finished after 5.000s timeout 480s
wodim: Cannot fixate disk.

The resulting CD was not a coaster though. A welcome surprise, but clearly I did something wrong regarding timeouts, and it needs fixing (although I'm quite sure that there's no other person on Earth who would want to burn CDs with ub).

BTW, the new cdrecord looks nice indeed. Before, I only used the one maintained by that self-centered dude with attitude... No idea who maintains this one, but it seems working ok.

[link] 2 comments|post comment

What Would Rusty Say? [21 Mar 2008|09:32pm]

One of the many great things Rusty has done was introducing the Misuse Levels of APIs (in OLS 03 keynote, slide 30 and beyond). I had a run-in with something of that nature last week.

Here's an interface:

/**
 * blk_end_request - Helper function for drivers to complete the request.
 * @rq:       the request being processed
 * @error:    0 for success, < 0 for error
 * @nr_bytes: number of bytes to complete
 *
 * Description:
 *     Ends I/O on a number of bytes attached to @rq.
 *     If @rq has leftover, sets it up for the next range of segments.
 *
 * Return:
 *     0 - we are done with this request
 *     1 - still buffers pending for this request
 **/
int blk_end_request(struct request *rq, int error, unsigned int nr_bytes)

What do you think the "number of bytes to complete" is? It seemed natural to me that it's the number of bytes which was transferred (and thus, it can be smaller than the number of bytes remembered in the request). This is how I would design an API. But in this case, nr_bytes is the number of bytes which was in the request initially. As such, it is greater than the request->data_len, which drivers modify to indicate the residue.

I think this has something to do with Tomo's & Jens' desire to avoid modifying drivers which poke ->data_len today (indeed, the code doing so in ub remained unchanged). If so, the price is too steep, IMHO.

Curiously, the designers of the API themselves misused it when they converted ub. They called __blk_end_request() with and argument of blk_rq_bytes(rq), but since ub modifies ->data_len, it guaranteed a failure for packet requests.

Everything seems to be working now, but I suspect that 2.6.25 is going to ship with a broken ub (thank Chris Wright for the Stable Tree).

UPDATE: See also a blog article (same server, but helps if Rusty decides to reshuffle his home directory).

[link] post comment

Irony of the day [20 Mar 2008|09:55pm]

Seen today:

  PID USER      PR  NI  VIRT  RES  S %CPU %MEM    TIME+  COMMAND
23015 root      20   0  311m  63m  S  1.0  7.4 148:50.50 Xorg   
23661 zaitcev   20   0  446m 9324  S  1.0  1.1 117:00.18 gnome-power-man

So, the process which is supposed to save my CPU cycles is responsible for consuming almost as much as X server, which does a heck of a lot of work. Isn't it ironic?

I suspect Gnome Power Manager loses its mind when I close the lid overnight.

[link] 2 comments|post comment

The LJ dorama [18 Mar 2008|04:15pm]
I welcome the deprecation of "free" LJ accounts. I always found it unfair and unsustainable how paying users subsidized unpaid users. As for Brad, being a former honcho gives him some insight points, but not too many. Look at Craig, does he blog about dumb people turning his website into something he does not like?
[link] 4 comments|post comment

When release is not [18 Mar 2008|12:19am]
Spent a weekend on a bug which looked like some interesting race, but in fact was just a simple logic error. Result is a trivial one line patch, I'm two days older and not an iota smarter.

BTW, our input subsystem is really convoluted. I'm not surprised bugs like this happen.
[link] 1 comment|post comment

LOL X11 [05 Mar 2008|05:00pm]

A minute ago I pulled a mouse out of my laptop and the X server crashed. The last messages in Xorg.0.log were:

(EE) Read error: No such device (19, -1 != 24)
(II) Microsoft Microsoft USB Wireless Mouse: Off

This happened because Xorg server 1.4.99.1 force-loads evdev contrary to my xorg.conf, and we all know that evdev is a crash city.

It's a good thing I was born and raised back when this sort of thing was expected every day, so X cannot catch me with a half-entered bug report in a browser. Also, it seems that Vim finally learned to remove dead .swp files without annoying users.

[link] 1 comment|post comment

GNOME does something right [12 Feb 2008|01:19pm]

Today I plugged my iPod in, and Rawhide launched Rhythmbox, which chugged along for about half a minute, then crashed. At least it did not wipe the player's database. WTF, I thought I had all auto-run types set to off...

Looks like there were some changes. No more unchecking umpteenth type of crashware.

To compensate for a good thing, they worked extra hard to hide this panel. It's not reacheable from any parts of System menu. File Browser has to be started, and its Preferences adjusted.

UPDATE: The last part is untrue, I missed the right item. Thanks ucc_journal for the correction.

[link] 2 comments|post comment

GNOME lies to me [08 Feb 2008|09:35am]

Every time I log in, it promises not to show the message again, but then does.

I suppose I should clean some keys in gconf-editor, but which?

Also, tinkering with the registry is so Windows.

[link] 2 comments|post comment

Centralized git at Xorg [03 Feb 2008|10:48am]

One side effect of the multiply-repo organization of Xorg is that I cannot clone a local repository without editing git_xorg.sh. In kernel, I usually have just one repository which tracks Linus (linux-2.6), then clone and blow away repositories as needed (linux-2.6-ub, linux-2.6.24-rc7, linux-2.6.23-253424). In X, git_xorg.sh has a variable on top which encodes the parent, which is not so bad. But still... Obviously they just do it differently, but how? Keith was saying something about extensive use of branches, so maybe that's it.

Also, since we're on topic, git_xorg.sh itself is not in the git. Now that's really odd, because how do I know if it's changed? I don't even remember now whence I downloaded it.

P.S. Another thing, before blowing away a repo, I would look quickly with "git diff" if anything interesting was left in it. Needless to say, this is impossible in Xorg, so it's just more evidence that they never clone anything.

[link] 2 comments|post comment

Have libsilc problem? [29 Jan 2008|09:39pm]

On one box, every run of yum update was complaining about something unavailable around libsilc... I wrote it down to mirror inconsistency and worked around with --exclude. Today I decided to look into it closer, and saw the following:

[root@simbelmyne zaitcev]# rpm -q pidgin fedora-release
pidgin-2.0.0-0.34.beta7devel.fc7.x86_64
fedora-release-8.90-3.noarch
[root@simbelmyne zaitcev]# yum update pidgin
Setting up Update Process
Could not find update match for pidgin
No Packages marked for Update
[root@simbelmyne zaitcev]# 

An F7 package has no update when we're builing F9? [*]

[root@simbelmyne zaitcev]# rpm -q --queryformat "%{epoch}\\n" pidgin
2
[root@simbelmyne zaitcev]# 

Suddenly the problem came into focus. The worst part of it, current pidgin is version 2.3.1, which is greater than 2.0.0. They could've kept epoch forever and nobody would've noticed... The changelog explains:

* Sat Apr 21 2007 Warren Togami <wtogami@redhat.com> 2.0.0-0.35.beta7devel
- upstream insists that we remove the Epoch
  rawhide users might need to use --oldpackage once to upgrade
- remove mono and howl cruft

* Wed Jul 12 2006 Jesse Keating <jkeating@redhat.com> 2:2.0.0-0.6.beta3.1
- rebuild

Hmm... Not sure why "upstream" cares, but whatever.

Just for fun I ran rpm|grep|wc and found a ton of packages with non-empty Epoch. Epoch 0: 108 packages, Epoch 1: 79, other values: 52. The biggest number has aspell-en: Epoch 50! The changelog is:

* Wed Aug 11 2004 Adrian Havill <havill@redhat.com> 50:0.51-9
- sync epoch with other aspell dicts, upgrade to 0.51-1

Now that must be a pretty funny story, but I don't want to know.

[*] Actually we have a few packages from F7 which weren't rebuilt across the whole Fedora 8 cycle, for example grep.

[link] 2 comments|post comment

Today, I hate... who? [29 Jan 2008|01:40pm]

Not sure who to hate today. At first I was going to hate Karsten, but I quickly realized that he's actually on the good side, fixing problems rather than causing them. But someone has to be responsible for the abomination known as libtool, right?

It all started when I wanted to run Hercules. I extracted my old images and hercules.cnf, ran "yum install hercules", and then... Hercules starts, but recognizes no devices, starting with the 3505.

The problem is in the so-called "dynamic load": emulators for devices are shared objects, and our stock Hercules on Fedora searches for them everywhere except where necessary:

write(4, "HHCCF065I Hercules: tid=2AD76E5C"..., 72) = 72
open("/hercules/hdt3505.la", O_RDONLY)  = -1 ENOENT
open("/hercules/hdt3505", O_RDONLY)     = -1 ENOENT
open("/lib/hdt3505.la", O_RDONLY)       = -1 ENOENT
open("/usr/lib/hdt3505.la", O_RDONLY)   = -1 ENOENT
open("hdt3505.la", O_RDONLY)            = -1 ENOENT
access("/lib/hdt3505", R_OK)            = -1 ENOENT
access("/usr/lib/hdt3505", R_OK)        = -1 ENOENT
open("/etc/ld.so.cache", O_RDONLY)      = 10
fstat(10, {st_mode=S_IFREG|0644, st_size=84779, ...}) = 0
mmap(NULL, 84779, PROT_READ, MAP_PRIVATE, 10, 0) = 0x2aaaaf55b000
close(10)                               = 0
open("/lib64/tls/hdt3505", O_RDONLY)    = -1 ENOENT
open("/lib64/hdt3505", O_RDONLY)        = -1 ENOENT
open("/usr/lib64/tls/hdt3505", O_RDONLY) = -1 ENOENT
open("/usr/lib64/hdt3505", O_RDONLY)    = -1 ENOENT
munmap(0x2aaaaf55b000, 84779)           = 0
write(4, "HHCCF042E Device type 3505 not r"..., 42) = 42

The real path is /usr/lib64/hercules/hdt3505.so. Did nobody ever test?!

So I download the source, configure with --disable-dynamic-load, build, everything works. After all, the whole /usr/lib64/hercules is only 240KB, who needs dynamic modules anyway? Then, I want to be good and try to build an RPM... It bombs with "ld: undefined hdl_genhdl".

It gets even more involved. Apparently, when I run configure by hand, libtool fails, falls back to linking from .a, and then everything works. But when I build an RPM, libtool succeeds, produces .so, then linking fails because...

So, spent a day trying to understand how libtool worked and why removal of rpath causes it to produce garbage, etc. until it was time to sleep. Today, filed a bug, moved on. Let Matthias to puzzle it out.

UPDATE 20080103: Hans de Goede fixed it.

[link] post comment

navigation
[ viewing | most recent entries ]
[ go | earlier ]